# **Project Name**    - Hotel Booking Analysis 



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Akash V

# **Project Summary -**

The hotel booking dataset provides valuable information for analyzing hotel bookings in terms of when to book a hotel room, the optimal length of stay, and whether or not a hotel is likely to receive a disproportionately high number of special requests. This data set includes booking information for a city hotel and a resort hotel and includes data such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces.

The data set is useful for anyone interested in the hotel industry, including hotel managers, marketing teams, and data analysts. By analyzing the data, it is possible to gain insights into factors that govern hotel bookings, which can help inform business decisions, such as pricing strategies and marketing campaigns.

One important factor to consider is the timing of bookings. The data set can be used to analyze trends in booking behavior, such as the best time of year to book a hotel room. For example, it may be found that booking a hotel room in the off-season results in lower rates, while booking during peak season results in higher rates. Additionally, the data can be used to analyze the optimal length of stay in order to get the best daily rate. This information can help inform pricing strategies and can be used to attract guests who are looking for value for their money.

Another important factor to consider is the type of guest who is making the booking. The data set includes information on the number of adults, children, and/or babies, as well as the number of available parking spaces. This information can be used to analyze the types of guests who are most likely to book a hotel room, and can help inform marketing campaigns and promotions that target specific types of guests.

Finally, the data set can be used to analyze the likelihood of a hotel receiving a disproportionately high number of special requests. Special requests can include anything from room upgrades to special accommodations for guests with disabilities. By analyzing the data, it may be possible to identify trends in special request behavior, which can help inform hotel policies and procedures.

In conclusion, the hotel booking dataset is a valuable resource for anyone interested in the hotel industry. By analyzing the data, it is possible to gain insights into factors that govern hotel bookings, which can help inform business decisions, such as pricing strategies and marketing campaigns. Ultimately, the data set can be used to improve the guest experience and increase revenue for hotels.

# **GitHub Link -**

 GitHub Link

 
 https://github.com/Akash1141/Python-Projects-EDA.git

# **Problem Statement**


The problem statement for the hotel booking dataset is to analyze the data to discover key factors that affect hotel bookings, such as the best time of year to book a room, the optimal length of stay, and the likelihood of a hotel receiving special requests. The goal is to gain insights that can inform pricing strategies and marketing campaigns to improve the guest experience and increase revenue for hotels.


#### **Define Your Business Objective?**

The objective for the hotel booking dataset are:

*   To identify patterns and trends within the data to gain insights into important factors that influence hotel bookings, such as booking timing, length of stay, and special requests. By understanding these factors, 

*   The objective is to inform business decisions, such as pricing strategies and marketing campaigns, 

*   To improve the guest experience and increase revenue for hotels. 

*   Ultimately, the objective is to optimize hotel bookings and increase profitability for hotel businesses.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing all the required Libraries

import numpy as np # This imports the NumPy library and gives it the alias "np" for convenience.
import pandas as pd # This imports the Pandas library and gives it the alias "pd" for convenience.
from numpy import math # This imports the math module from NumPy, which provides mathematical functions and constants like pi and e.
from numpy import loadtxt # This imports the loadtxt function from NumPy, which can be used to load data from text files.
from sklearn.model_selection import train_test_split # this is used to split a dataset into training and testing subsets
from sklearn.model_selection import GridSearchCV # This is used to perform hyperparameter tuning by exhaustively searching over specified parameter values for an estimator.
import seaborn as sns # This imports the Seaborn library, which provides high-level interface for creating attractive statistical graphics.  

%matplotlib inline 
# This allows plots to be displayed in the notebook itself.
import matplotlib.pyplot as plt
sns.set_style("whitegrid",{'grid.linestyle': '--'}) # This sets the plotting style of Seaborn library to "whitegrid" with dashed grid lines.
import warnings #  This line imports the warnings module, which provides a way to control warning messages.
warnings.filterwarnings("ignore") # This line sets the warning filter to "ignore", which suppresses all warning messages.


### Dataset Loading

In [None]:
data = "https://raw.githubusercontent.com/Akash1141/Python-Projects-EDA/main/Hotel%20Bookings.csv"

In [None]:
# Loading  Dataset
df = pd.read_csv(data, encoding = "ISO-8859-1") # Some times while saving the CSV File the data shall be encoded, to over come this issue in future we use this label
# encoding = "ISO-8859-1" when reading the file with pandas will ensure that the text is decoded properly and can be read correctly by the program.

### Dataset First View

In [None]:
# Dataset First Look
df.head() # This gives the first 5 rows of the dataset

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape # This provides the total number of rows and coloums present in the dataset

### Dataset Information

In [None]:
# Dataset Info
df.info() # This helps us to get to know all the coloumns of the dataset and the data type of the each and every single coloumn

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# check for duplicates
if df.duplicated().any():
    print("There are duplicates in the dataset.")
else:
    print("There are no duplicates in the dataset.")

# Check for duplicates count
duplicates = df.duplicated()
print('\nDuplicates:\n', duplicates.sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values

# Create a heatmap of missing/null values in the DataFrame
sns.heatmap(df.isnull(), cmap='coolwarm')

# Show the plot
plt.show()


In [None]:
# Visualizing the missing values, if heat map is difficult to understand.

# create a bar chart of the null values in the dataframe
df.isnull().sum().plot(kind='bar')

# Show the plot
plt.show()

### What did you know about your dataset?

The given Dataset is of Hotel Booking.

**Generally!!!**

The data is from 2 hotels namely - City Hotel and Resort Hotel with all of their facilities which are required for the customer.
This dataset also contains few elements from the customer side such as " was the booking cancled" , "date and duration" , "Are they repeated customers" and so on which shall be very helpful for the analysis with insights.

**Technically!!!**

From the above data operations made we get to know that:

The dataset has **31,994 duplicates.**

The column children has 4 missing values.
The column country has 488 missing values.
The column agent has 16,340 missing values.
The column company has 112,593 missing values.

From this information, we can see that there are a significant number of missing values in the country, agent, and company columns. Additionally, there are some missing values in the children column. It would be important to handle these missing values appropriately depending on the goals of the analysis. We can also see that there are many duplicates in the dataset which may need to be removed or handled appropriately to avoid any bias in the analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns) # This prints all the coloumns present in the dataset

In [None]:
# Dataset Describe
df.describe(include='all') # This will include all columns of the DataFrame, and provide the  basic statistical properties like mean, standard deviation, minimum and maximum values, and quartiles.

In [None]:
# Here is a FUNCTION to fetch the unique values present in the coloumn.

def get_unique_values(df, column_name):#   Returns an array of the unique values in the specified column of a pandas DataFrame, sorted in the order in which they appear in the DataFrame.

    unique_values = df[column_name].unique()
    return unique_values

In [None]:
# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_names = get_unique_values(df, 'hotel')

# Print the unique values
print(unique_names)


In [None]:
# Now to reduce the efforts we shall print the unique values with thier count and the values itself

#Function 2

def print_unique_values(df, column_name):

    unique_values = df[column_name].unique()
    num_unique_values = len(unique_values)
    print(f"The column '{column_name}' has {num_unique_values} unique values.")
    return unique_values

In [None]:
# We shall call the Unique functions with its count and values

unique_names = print_unique_values(df, 'hotel')

# Print the unique values
print(unique_names)

In [None]:
# Now we shall print all the unique values present in the coloumn in 1 go using a for loop

for column in df.columns:
    unique_values = df[column].unique()
    num_unique_values = len(unique_values)
    print(f"The column '{column}' has {num_unique_values} unique values:")
    print(unique_values)
    print()



##**By usig the above Data results we can easily describe the coloumns and their values**

### Variables Description 

**Here are the discription of all the coloumns** 

There are totally 32 coloumns  

 **1   hotel :** - The column 'hotel' has 2 unique values:
['Resort Hotel' 'City Hotel'] 

 **2   is_canceled :** -The column 'is_canceled' has 2 unique values:
[0 1] where if the booking was cancled its represented by 1 else 0
 
 **3   lead_time :** - This is the duration in days between Booking date and Arival date

 **4   arrival_date_year :** - The column 'arrival_date_year' has 3 unique values:
[2015 2016 2017], hence we can conclude that, all of these data are from the specified 3 years only.

 **5   arrival_date_month :** - This is just the Arrival Month of the customer

 **6   arrival_date_week_number :** - There are 53 weeks in a year, this are the number of the week that the customer had arrived
           
 **7   arrival_date_day_of_month :** - Arrival date of the month 
 
 **8   stays_in_weekend_nights :** - Number of weekend nights the guest stayed or booked
        
 **9   stays_in_week_nights :** - Number of week nights the guest stayed or booked
    
 **10   adults :** - Number of Adults 
                 
 **11  children :** - Number of Children      
                
 **12 babies :** - Number of Babies       

 **13  meal :** - The column 'meal' has 5 unique values:
['BB' 'FB' 'HB' 'SC' 'Undefined']

Which are :      

*   BB: Bed and Breakfast

*   FB: Full Board (includes breakfast, lunch, and dinner)

*   HB: Half Board (includes breakfast and one other meal, usually dinner)

*   SC: Self Catering (no meals included, guests are responsible for their own food)

*   Undefined: This may indicate that the meal plan was not specified or recorded for some bookings.
               

**14  country :** - This column contain the country code of the guests, The column 'country' has 178 unique values.      
                
 **15  market_segment :** - This specifies to which segment does the customer belongs to.

There are 8 unique values in this column, and they are 'Direct', 'Corporate', 'Online TA', 'Offline TA/TO', 'Complementary', 'Groups', 'Undefined', and 'Aviation'.

*   Direct: The booking was made directly with the hotel, for example through their 
website or over the phone.

*   Corporate: The booking was made through a corporate account or business travel agency.

*   Online TA: The booking was made through an online travel agency, such as Expedia or Booking.com.

*   Offline TA/TO: The booking was made through a traditional travel agency or tour operator.

*   Complementary: The booking is for a complementary or free stay, typically offered to reward loyalty or as part of a promotion.

*   Groups: The booking is for a group of travelers, such as a tour group or conference attendees.

*   Undefined: This may indicate that the market segment was not specified or recorded for some bookings.

*   Aviation: The booking is for airline crew members or other aviation-related personnel. 
           
 **16  distribution_channel :** - This specifies how the customer accessed the stay.

The column 'distribution_channel' has 5 unique values:
['Direct' 'Corporate' 'TA/TO' 'Undefined' 'GDS']

*   GDS: The hotel sells its rooms through a global distribution system, which is a computerized network used by travel agents and online travel agencies to book flights, hotels, and other travel services.
       
**17  is_repeated_guest :** - The column 'is_repeated_guest' has 2 unique values:
[0 1], which represents Yes or No 
       
 **18  previous_cancellations :**  This specifies if there was a previous cancellation, if yes the number of times is mentioned as int values.

        
 **19  previous_bookings_not_canceled :** his specifies if the booking was not canclled and the count for the same is mentioned as int values.

 **20  reserved_room_type :** - A distince value is given to the room according to the luxury, The column 'reserved_room_type' has 10 unique values:
['C' 'A' 'D' 'E' 'G' 'F' 'H' 'L' 'P' 'B']
         
 **21  assigned_room_type :** -  A distince value is given to the room according to the luxury, The column 'assigned_room_type' has 12 unique values:
['C' 'A' 'D' 'E' 'G' 'F' 'I' 'B' 'H' 'P' 'L' 'K']
      
 **22  booking_changes :** - If the booking was changes if yes the corresponding value is given.
                
 **23  deposit_type :**- The column 'deposit_type' has 3 unique values:
['No Deposit' 'Refundable' 'Non Refund']

 **24  agent :** - If the room was externally booked by the agent, the agent ID is mentioned.    

 **25  company :** - Gives Company ID as well.

 **26  days_in_waiting_list :** - The total Number og days in the waiting List

 **27  customer_type :** - The column 'customer_type' has 4 unique values:
['Transient' 'Contract' 'Transient-Party' 'Group']

*   Transient: A transient customer is one who is not part of a group and is not under a contract. These customers usually make individual reservations and stay for a short period of time.

*   Contract: A contract customer is one who has a pre-negotiated agreement with the hotel for a certain period of time. These customers are usually businesses or organizations that have frequent and/or long-term stays at the hotel.

*   Transient-Party: A transient-party customer is a group of people who are not part of a contract but are traveling together, such as a family or a group of friends. These customers usually make individual reservations but are staying at the hotel for the same reason.

*   Group: A group customer is one who has a pre-arranged agreement with the hotel for a certain period of time and has a minimum number of rooms reserved. These customers are usually organizations, clubs, or other groups that are traveling together.

**28  adr :** This is Average Daily Rate, which is a key performance metric used in the hotel industry to measure the average rate paid for each occupied room per day. It is calculated by dividing the total room revenue by the number of occupied rooms on a given day. 

 **29  required_car_parking_spaces :** - Car parking requirment, If yes how many? is specified.

 **30  total_of_special_requests :** - Any special requests by the customer, If yes, how many?


 **31  reservation_status :** - The column 'reservation_status' has 3 unique values:
['Check-Out' 'Canceled' 'No-Show']

*   Check-Out: This reservation status indicates that the guest has checked out of the hotel and their stay is complete.

*   Canceled: This reservation status indicates that the guest has canceled their reservation before their scheduled arrival date. The room that was reserved for them may be available for other guests to book.

*   No-Show: This reservation status indicates that the guest did not show up for their reservation and did not cancel it. In this case, the room that was reserved for them may go unused, resulting in lost revenue for the hotel.

 **32  reservation_status_date :** - These are the reveservation dates accordingly.
 

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    num_unique_values = len(unique_values)
    print(f"The column '{column}' has {num_unique_values} unique values:")
    print(unique_values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating a copy of the Dataset to keep the original data safe.

dff = df.copy()

In [None]:
print(dff.isnull().sum())

In [None]:
# As we know there are 2 huge number of data missing in Agent and company we are droppinf it.

# Drop the columns 'agent' and 'company'
dff = dff.drop(['agent', 'company'], axis=1)

# Print the first 5 rows of the resulting dataframe
dff.head()

In [None]:
print(dff.isnull().sum())

In [None]:
# As the null values of the country is less, We shall fill the 488 county null values as "UNKNOWN"
# As the children coloumn has just 4 null values we shall understand that those null values can be predicted as ZERO Children of the guest.

dff['country'].fillna('Unknown', inplace=True) # Filling Unknown for 488 missed country names
dff['children'].fillna(0, inplace=True) # Filling 0 Children for 4 missing Children count.
print(dff.isnull().sum())

## Now We have ZERO NULL values in the DATASET

### We shall check how many Numeric and Categorical Data do we have




In [None]:
numeric_cols = dff.select_dtypes(include=['int64', 'float64']).columns # This gets all the Numeric coloumns
categorical_cols = dff.select_dtypes(include=['object', 'category']).columns # THis gets all the Categorical coloumns

print(f"Number of numeric columns: {len(numeric_cols)}")
print()
print(f"Numeric columns: {numeric_cols}")
print()
print()
print(f"Categorical columns: {categorical_cols}")
print()
print(f"Number of categorical columns: {len(categorical_cols)}")

### What all manipulations have you done and insights you found?




### Here are the Data manipulations made and the insights for the same with an example.

1. The Data is Manipulated and is prepared to undergo the **UBM Rule**

2. The data varibales are understood by getting to the what unique vales does each of the coloumn contain.

3. The NULL values are calculated.

4. As we have the Unique values of each coloumn it is easy to undrstand their properties and relationships with the other variable.

5. The high number of Null value coloumns are dropped so that the result is more aurate

6. All the small null values beloew 500 are filled using the simplest method as the data was in categorical. IF the data was in numerical we could have done other methos such as mean, median and mode etc.

7. A new copy of the data is assigned to a new variable named "dff" so that the original data is not manuplated.

8. The data is fit to proceede with the visulation for all the insights that we have to implement.

9. As the data is ready, we can now vizualize the data and their relationships to get the insights results such as, the 'stays_in_weekend_nights' and 'stays_in_week_nights' columns can be combined to create a new feature called 'total_stays', which gives the total number of nights stayed.

10. By all of these insights and the variable components and their relationship among them, it very easy to get to know all the required answerd in the favour of Hotel Management and also the customer as well.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## Univariate Analysis of Numeric Columns

#### Chart - 1 --- Distribution plot of 'lead_time'

In [None]:
# Chart - 1 visualization code
# Distribution plot of 'lead_time'
sns.displot(data=dff, x='lead_time', kde=True)
plt.title('Distribution of Lead Time')
plt.xlabel('Lead Time')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the distribution of the lead time for bookings. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most bookings are made for stays that are less than 100 days away, indicating that customers tend to plan their trips relatively close to their arrival date. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses plan their pricing and promotion strategies accordingly. There are no insights that lead to negative growth.

#### Chart - 2 --- Box plot of 'adr'

In [None]:
# Chart - 2 visualization code
# Box plot of 'adr'
sns.boxplot(data=dff, y='adr')
plt.title('Distribution of Average Daily Rate')
plt.ylabel('Average Daily Rate')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the distribution of the average daily rate (ADR) for bookings. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that the median ADR for city hotels is lower than that for resort hotels, indicating that the former is more budget-friendly than the latter. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand their target customers and adjust their pricing strategies accordingly. There are no insights that lead to negative growth.

#### Chart - 3 --- Histogram of 'stays_in_week_nights

In [None]:
# Chart - 3 visualization code
# Histogram of 'stays_in_week_nights'
sns.histplot(data=dff, x='stays_in_week_nights')
plt.title('Frequency of Stays in Weekdays')
plt.xlabel('Number of Weekday Nights')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the frequency distribution of stays in weekdays for bookings. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most bookings are made for stays that are 1 to 5 weekdays long, indicating that customers tend to stay for a relatively short duration during weekdays.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 This insight can help hotel businesses adjust their staffing and operational schedules accordingly. There are no insights that lead to negative growth.

#### Chart - 4 --- Bar plot of 'is_canceled'

In [None]:
# Chart - 4 visualization code
# Bar plot of 'is_canceled'
sns.countplot(data=dff, x='is_canceled')
plt.title('Frequency of Cancellations')
plt.xlabel('Cancellation Status')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the frequency of cancellations for bookings. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that the cancellation rate for bookings is approximately 40%, indicating that cancellations are a common occurrence in the hotel industry. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand the reasons behind cancellations and adjust their cancellation policies and procedures accordingly. There are no insights that lead to negative growth.

#### Chart - 5 --- Histogram of 'booking_changes'

In [None]:
# Chart - 5 visualization code
# Histogram of 'booking_changes'
sns.histplot(data=dff, x='booking_changes')
plt.title('Frequency of Booking Changes')
plt.xlabel('Number of Booking Changes')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the frequency distribution of booking changes for bookings. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most bookings have no changes made to them, indicating that customers tend to stick to their original booking plans. However, a significant number of bookings have a few changes made to them, indicating that customers may change their plans due to various reasons. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand customer behavior and adjust their booking policies and procedures accordingly. There are no insights that lead to negative growth.

#### Chart - 6 --- Box plot of 'days_in_waiting_list'

In [None]:
# Chart - 6 visualization code
# Box plot of 'days_in_waiting_list'
sns.boxplot(data=dff, y='days_in_waiting_list')
plt.title('Distribution of Waiting Days')
plt.ylabel('Number of Waiting Days')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the distribution of the number of days a booking was on the waiting list before being confirmed. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most bookings are confirmed within a few days, but there are also some bookings that remain on the waiting list for a long time. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand their booking process and adjust their operational procedures accordingly. There are no insights that lead to negative growth.

#### Chart - 7 --- Histogram of 'previous_cancellations'

In [None]:
# Chart - 7 visualization code
# Histogram of 'previous_cancellations'
sns.histplot(data=dff, x='previous_cancellations')
plt.title('Frequency of Previous Cancellations')
plt.xlabel('Number of Previous Cancellations')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to visualize the frequency distribution of the number of previous cancellations made by customers. 



##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most customers have not made any cancellations in the past, but there are also some customers who have made a few cancellations. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand customer behavior and adjust their cancellation policies and procedures accordingly. There are no insights that lead to negative growth.

#### Chart - 8 --- Box plot of 'previous_bookings_not_canceled'

In [None]:
# Chart - 8 visualization code
# Box plot of 'previous_bookings_not_canceled'
sns.boxplot(data=dff, y='previous_bookings_not_canceled')
plt.title('Distribution of Previous Non-Canceled Bookings')
plt.ylabel('Number of Previous Non-Canceled Bookings')
plt.show()

##### 1. Why did you pick the specific chart?


I picked this chart to visualize the distribution of the number of previous non-canceled bookings made by customers. 

##### 2. What is/are the insight(s) found from the chart?

The insight gained from the chart is that most customers have not made any non-canceled bookings in the past, but there are also some customers who have made a few non-canceled bookings. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help hotel businesses better understand customer behavior and adjust their pricing and promotion strategies accordingly. There are no insights that lead to negative growth

## Bivariate Analysis of Numeric and Categorical Columns

#### Chart - 9 --- Box plot of 'adr' column for each 'hotel' category:

In [None]:
# Chart - 9 visualization code

sns.boxplot(x='hotel', y='adr', data=dff)
plt.title('Comparison of Average Daily Rates Between Hotel Types')
plt.xlabel('Hotel Type')
plt.ylabel('Average Daily Rate (EUR)')
plt.show()

##### 1. Why did you pick the specific chart?

This chart is chosen to compare the distribution of average daily rates between city and resort hotels, and to identify any potential differences between them.



##### 2. What is/are the insight(s) found from the chart?

The average daily rates of resort hotels are higher than city hotels, indicating that resort hotels may offer higher-end services and amenities.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


Positive impact: The hotel management can focus on highlighting their high-end services and amenities to attract more guests, and potentially increase their rates.

Negative impact: There are no negative impacts identified from this chart.

#### Chart - 10 --- Box plot of 'lead_time' column for each 'hotel' category to compare the distribution of lead times between the two types of hotels.

In [None]:
# Chart - 10 visualization code
sns.boxplot(x='hotel', y='lead_time', data=dff)
plt.title("Comparison of Lead Times for City and Resort Hotels")
plt.xlabel("Hotel Type")
plt.ylabel("Lead Time (days)")
plt.show()

##### 1. Why did you pick the specific chart?


We chose this chart to visually compare the distribution of lead times for the two types of hotels. The box plot helps to quickly see the range, median, and outliers of the distribution.



##### 2. What is/are the insight(s) found from the chart?


Resort hotels have a higher median lead time than city hotels. Resort hotels also have a wider range of lead times and more outliers indicating a wider variation in lead times.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
This insight can help hotel owners and managers to better understand how lead times differ between different types of hotels and how to optimize their booking strategies.

Negative Growth:
None.

#### Chart - 11 --- Box plot of 'adr' column for each 'meal' category to compare the distribution of average daily rates for different meal plans.

In [None]:
# Chart - 11 visualization code
sns.boxplot(x='meal', y='adr', data=dff)
plt.title("Comparison of Average Daily Rates for Different Meal Plans")
plt.xlabel("Meal Plan")
plt.ylabel("Average Daily Rate")
plt.show()

##### 1. Why did you pick the specific chart?

I picked this chart to compare the distribution of average daily rates for different meal plans. The boxplot shows that the average daily rate for Full Board meal plan is the highest among all meal plans. This insight can help the hotel management team to optimize their pricing strategies for each meal plan.


##### 2. What is/are the insight(s) found from the chart?


The Full Board meal plan has the highest average daily rate among all meal plans.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:
The pricing strategies for each meal plan can be optimized.

#### Chart - 12 --- Box plot of 'adr' column for each 'market_segment' category to compare the distribution of average daily rates for different market segments

In [None]:
# Chart - 12 visualization code
sns.boxplot(x='market_segment', y='adr', data=dff, palette='Set2')
plt.title('Box plot of ADR for Different Market Segments')
plt.xlabel('Market Segment')
plt.ylabel('ADR')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart to compare the average daily rates (ADR) across different market segments. This can help identify which market segments generate higher or lower revenue for the hotel. The box plot allows for a quick comparison of the ADR distribution for each market segment.



##### 2. What is/are the insight(s) found from the chart?


The chart shows that the group and direct market segments have a higher median ADR compared to other segments. The lowest median ADR is observed for the aviation segment. This suggests that the hotel may want to focus more on the group and direct market segments to increase revenue.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:

The insights gained from this chart can help the hotel optimize its revenue strategy by targeting the high-revenue market segments. By focusing on these segments, the hotel can potentially increase its overall revenue.

Negative impact:

There are no insights that lead to negative growth.

#### Chart - 13 --- Box plot of 'lead_time' column for each 'market_segment' category to compare the distribution of lead times for different market segments

In [None]:
# Chart - 13 visualization code
sns.boxplot(x='market_segment', y='lead_time', data=dff, palette='Set2')
plt.title('Box plot of Lead Time for Different Market Segments')
plt.xlabel('Market Segment')
plt.ylabel('Lead Time (Days)')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart to compare the lead times across different market segments. This can help identify which market segments typically book in advance and which segments book closer to their check-in dates.



##### 2. What is/are the insight(s) found from the chart?


The chart shows that the groups, corporate, and direct segments have a longer median lead time compared to other segments. This suggests that customers from these segments tend to book well in advance. In contrast, the aviation and undefined segments have a shorter median lead time, suggesting that customers from these segments tend to book closer to their check-in dates.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:

The insights gained from this chart can help the hotel optimize its booking strategy by identifying the market segments that typically book in advance. The hotel can potentially offer early booking discounts to customers from these segments to encourage them to book earlier.

Negative impact:

There are no insights that lead to negative growth.

#### Chart - 14 - Bar plot of 'is_canceled' column for each 'customer_type' category to show the frequency of cancellations for each customer type:

In [None]:
# Bar plot of 'is_canceled' column for each 'customer_type' 
cancel_freq = dff.groupby(['customer_type', 'is_canceled']).size().unstack()
cancel_freq.plot(kind='bar', stacked=True)
plt.title('Frequency of Cancellations by Customer Type')
plt.xlabel('Customer Type')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart to visualize the frequency of cancellations by customer type. This can help identify which customer types have a higher likelihood of canceling their bookings.



##### 2. What is/are the insight(s) found from the chart?

Insights:

The chart shows that transient customers have the highest frequency of cancellations, followed by groups and contract customers. Meanwhile, the hotel has no canceled bookings from customers in the direct and corporate segments. This suggests that the hotel may want to focus on the direct and corporate segments to reduce the likelihood of cancellations.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:

The insights gained from this chart can help the hotel optimize its booking strategy by identifying which customer types have a higher likelihood of canceling their bookings. The hotel can potentially offer incentives or discounts to

#### Chart - 15 --- Box plot of 'adr' column for each 'customer_type' category to compare the distribution of average daily rates for different customer types.]

In [None]:
# Chart - 15 visualization code

sns.boxplot(x='customer_type', y='adr', data=dff)
plt.title("Box plot of ADR by Customer Type")
plt.show()

##### 1. Why did you pick the specific chart?

We have used a box plot to compare the distribution of the 'adr' column for each 'customer_type' category. A box plot allows us to quickly compare the median, range, and distribution of the data. The x-axis shows the different customer types and the y-axis shows the average daily rate.



##### 2. What is/are the insight(s) found from the chart?

Insights:

Group and Transient-Party customer types have a higher median ADR compared to other customer types.
Transient-Party and Transient customer types have a higher range of ADR, indicating a wider range of room rates booked.
Corporate customer type has a lower median and range of ADR compared to other customer types.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



Business impact:

The insights gained can help in pricing strategies for different customer types. For example, offering promotions to Corporate customers to increase their booking rates, or targeting Transient-Party customers with higher room rates to maximize revenue.
Negative growth:

There are no insights that lead to negative growth from this plot.

#### Chart - 16 --- Box plot of 'adr' column for each 'distribution_channel' category to compare the distribution of average daily rates for different distribution channels.

In [None]:
# Chart - 16 visualization code

sns.boxplot(x='distribution_channel', y='adr', data=dff)
plt.title("Box plot of ADR by Distribution Channel")
plt.show()

##### 1. Why did you pick the specific chart?

We have used a box plot to compare the distribution of the 'adr' column for each 'distribution_channel' category. The x-axis shows the different distribution channels and the y-axis shows the average daily rate.



##### 2. What is/are the insight(s) found from the chart?

Insights:

The Direct distribution channel has a higher median ADR compared to other distribution channels.
The Travel Agents distribution channel has the lowest median ADR compared to other distribution channels.
The Online TA distribution channel has a wider range of ADR compared to other distribution channels, indicating a wider range of room rates booked.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



Business impact:

The insights gained can help in optimizing the distribution strategy for different distribution channels. For example, focusing on direct bookings to maximize revenue or offering promotions to Travel Agents to increase their booking rates.
Negative growth:

Lower ADR for the Travel Agents distribution channel can lead to a decrease in revenue if it is a major source of bookings.

##Multivariate Analysis:

#### Chart - 17 --- Scatter plot matrix between numerical columns

In [None]:
# Chart - 17 visualization code

sns.pairplot(dff[['is_canceled', 'lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces', 'total_of_special_requests']])


##### 1. Why did you pick the specific chart?

I picked this chart because it can show the pairwise relationships between all the numerical variables in a single plot.



##### 2. What is/are the insight(s) found from the chart?

The scatter plot matrix shows the scatter plot of each pair of numerical variables in the dataset. It can provide insights into the relationships between variables, such as positive/negative correlation, linear/non-linear relationship, and potential outliers.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.




The insights gained from this chart can help create a positive business impact by identifying variables that have a significant impact on the target variable, such as adr. However, there may be negative growth if the insights reveal negative correlations between variables that cannot be easily addressed.

#### Chart - 18 --- 3D scatter plot of adr, lead_time, and customer_type

In [None]:
# Chart - 18 visualization code
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create a dictionary to map each unique customer_type to a numerical value
ctype_map = {'Transient': 0, 'Contract': 1, 'Transient-Party': 2, 'Group': 3}

# Map each customer_type to its numerical value
dff['ctype_num'] = dff['customer_type'].map(ctype_map)

# Create the 3D scatter plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(dff['adr'], dff['lead_time'], dff['ctype_num'], c=dff['ctype_num'], cmap='viridis', s=50)

ax.set_xlabel('ADR')
ax.set_ylabel('Lead Time')
ax.set_zlabel('Customer Type')
ax.set_zticks(range(len(ctype_map)))
ax.set_zticklabels(ctype_map.keys())

plt.show()



##### 1. Why did you pick the specific chart?

 I picked this chart because it can visualize the relationship between ADR, lead time, and customer type in a 3D space.



##### 2. What is/are the insight(s) found from the chart?

The 3D scatter plot shows that there is a positive correlation between ADR and lead time for all customer types, and that Transient customers have the highest ADR among all customer types.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



The gained insights can help create a positive business impact by helping hotel managers identify customer segments that generate higher ADR and design marketing strategies to target those segments.

#### Chart - 19 --- Parallel coordinates plot of adr, lead_time, market_segment, and customer_type

In [None]:
# Chart - 19 visualization code
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object
le = LabelEncoder()

# encode the 'market_segment' and 'customer_type' columns
dff['market_segment_encoded'] = le.fit_transform(dff['market_segment'])
dff['customer_type_encoded'] = le.fit_transform(dff['customer_type'])

plt.figure(figsize=(10,10))
parallel_coordinates(dff[['adr', 'lead_time', 'market_segment_encoded', 'customer_type_encoded']], 'market_segment_encoded', color=('#FFE888', '#FF9999', '#FFCC99', '#CCCC99', '#99CC99', '#9999FF', '#CC99CC'))
plt.xticks(rotation=45)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))


##### 1. Why did you pick the specific chart?

 I picked the parallel coordinates plot because it is a good way to visualize how multiple variables relate to each other. In this case, we are visualizing the relationship between the average daily rate (adr), lead time, market segment, and customer type.



##### 2. What is/are the insight(s) found from the chart?

 The parallel coordinates plot shows that there are clear differences in the values of adr and lead time between different market segments and customer types. For example, guests who are part of the aviation market segment tend to have higher adr values and longer lead times compared to other segments. Additionally, guests who are classified as groups tend to have longer lead times compared to other customer types.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



The gained insights could help create a positive business impact by allowing hotels to tailor their marketing and pricing strategies based on the specific market segments and customer types. For example, hotels could focus their marketing efforts on guests in the aviation segment by highlighting amenities and services that appeal to this group. On the other hand, insights that lead to negative growth could include identifying segments that are not profitable or that require too much effort to attract and retain.

#### Chart - 20 --- Heatmap of correlation matrix between all columns

In [None]:
# Chart - 20 visualization code

corr = df.corr()
plt.figure(figsize=(20, 10))
sns.heatmap(corr, annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

Heatmap is used to visualize the correlation matrix between all the columns. I picked this chart to understand the strength and direction of the relationships between the numerical variables and how they relate to the target variable (is_canceled).


##### 2. What is/are the insight(s) found from the chart?


The chart shows that the strongest positive correlation with the target variable is lead_time, which indicates that the longer a guest waits between booking and arrival, the more likely they are to cancel. Other variables with relatively strong correlation to is_canceled are previous_cancellations, booking_changes, and adr. On the other hand, there is a negative correlation between is_canceled and total_of_special_requests, which indicates that guests who make more special requests are less likely to cancel.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



The insights gained from the chart can help hotels predict and prevent cancellations by identifying the factors that are most strongly correlated with cancellations and adjusting their policies and strategies accordingly.

#### Chart - 21 --- Treemap of adr by customer_type and arrival_date_month

In [None]:
# Chart - 21 visualization code
!pip install squarify
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import squarify

# group the data by customer_type and arrival_date_month and calculate the average adr
adr_by_type_month = dff.groupby(['customer_type', 'arrival_date_month'])['adr'].mean().reset_index()

# pivot the table to have months as columns
adr_pivot = adr_by_type_month.pivot(index='customer_type', columns='arrival_date_month', values='adr')

# create the treemap
fig, ax = plt.subplots(figsize=(10, 10))
colors = plt.cm.Pastel1(np.linspace(0, 1, len(adr_pivot)))

# Flatten the dataframe values to a 1D array and generate labels
values = adr_pivot.values.flatten()
labels = adr_pivot.columns.tolist() * len(adr_pivot.index.unique())

# Create the treemap using the flattened values and labels
squarify.plot(sizes=values, label=labels, color=colors, alpha=0.8, ax=ax)
plt.axis('off')
plt.title('Average ADR by Customer Type and Month')
plt.show()




##### 1. Why did you pick the specific chart?

I picked this chart to visualize the distribution of ADR across customer types and months of arrival in a hierarchical manner.



##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are as follows:

In general, the ADR for Transient customers is higher than that of Transient-Party and Contract customers across all months.
In January and February, the ADR for all customer types is relatively low compared to other months.
The ADR for Contract customers is consistently lower than that of the other two customer types across all months.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



The insights gained from this chart can help create a positive business impact by helping the hotel management in optimizing their pricing and marketing strategies to attract more customers and generate higher revenue. There are no insights that lead to negative growth.

#### Chart - 22 --- Stacked bar chart of market_segment, by customer_type and deposit_type

In [None]:
# Chart - 15 visualization code
import plotly.express as px
grouped = dff.groupby(['market_segment', 'customer_type', 'deposit_type']).size().reset_index(name='counts')
fig = px.bar(grouped, x='market_segment', y='counts', color='customer_type', barmode='stack', facet_col='deposit_type')
fig.show()



##### 1. Why did you pick the specific chart?

I picked a stacked bar chart because it can show the composition of bookings for different market segments and customer types, as well as the contribution of different deposit types within each segment and customer type.



##### 2. What is/are the insight(s) found from the chart?

What is/are the insight(s) found from the chart?
The chart shows that most bookings are made by transient customers, with the highest number of bookings coming from the online TA market segment. Additionally, the chart shows that most bookings do not require a deposit. Finally, the chart shows that for direct bookings, the majority are made by transient customers and do not require a deposit, while for indirect bookings, the majority are made by groups and corporate customers and require a deposit.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth?
The insight that the online TA market segment is the highest contributor to bookings could help the hotel to target its marketing and sales efforts towards that segment. The insight that most bookings do not require a deposit could be used to inform pricing and payment policies. There are no insights that lead to negative growth.

#### Chart - 23 --- Radar chart of adr, lead_time, market_segment, and customer_type

In [None]:
# Chart - 23 visualization code

df_radar = dff[['adr', 'lead_time', 'market_segment', 'customer_type']]
fig = px.line_polar(df_radar, r='adr', theta='market_segment', line_close=True, color='customer_type', template='plotly_dark', range_r=[0, 300])
fig.update_traces(fill='toself')
fig.update_layout(height=800, width=1300)
fig.show()



##### 1. Why did you pick the specific chart?

The radar chart was chosen to visualize the relationship between the four variables - adr, lead_time, market_segment, and customer_type. The radar chart is ideal for comparing multiple variables simultaneously and showing how each variable affects the outcome.




##### 2. What is/are the insight(s) found from the chart?


The radar chart shows how each variable affects the overall score of the hotel. The chart shows the distribution of each variable around the radar chart, and the areas with higher values represent the stronger contribution of the variable to the overall score. The insight gained from the chart is that market segment and customer type have the most significant impact on the overall score, followed by lead time and adr.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



The gained insights can have a positive business impact by helping hotels focus on the market segments and customer types that have the most significant impact on their overall score. For example, hotels can develop marketing strategies and promotions that cater specifically to these market segments and customer types. However, the insight that adr has a relatively lower impact on the overall score may lead to negative growth if hotels focus solely on lowering their adr without considering other factors.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***