# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Student Name**    - Monali Vijay Mhaske

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/MonaliM5/Hotel-Booking-EDA-Project

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Libraries for Data Handling
import numpy as np
import pandas as pd

#Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
from matplotlib import cm
%matplotlib inline

# Supressing Warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mounting' google drive where dataset is saved.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Module 2.5 - Project EDA/Hotel Bookings.csv")





### Dataset First View

In [None]:
# Dataset First Look
# Going through the first 5 records of the dataset
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows in dataset   :",df.shape[0])
print(f"Numberof columns in dataset :",df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
# Checking the datatype and other information of all the features of dataset.
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

In [None]:
#Dropping the duplicate records
df.drop_duplicates(inplace = True)

In [None]:
# Shape of dataset after dropping duplicate values
df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Checking missing values count and percentage
missing_df = df.isnull().sum().reset_index()
missing_df.columns = ['Column', 'Missing_Values']
missing_df['Missing_%'] = 100 * missing_df['Missing_Values'] / len(df)
missing_df = missing_df[missing_df['Missing_Values'] > 0].sort_values(by='Missing_%', ascending=False)

# Displaying the result
print(missing_df)


* I examined the dataset for missing values and found that a few columns such as agent, company, children, and country contain null entries.
* The agent and company columns have the highest proportion of missing data.
* To understand the impact, I visualized the percentage of missing values in each column using a horizontal bar chart.

In [None]:
# Visualizing the missing or null values

# Setting Figure size
plt.figure(figsize=(10, 6))

# Code to plot the data
sns.barplot(data=missing_df, x='Missing_%', y='Column', palette='viridis')

# Setting labels and title
plt.title('Missing Values by Column')
plt.xlabel('Percentage of Missing Values')
plt.ylabel('Column Name')
plt.tight_layout()
plt.show()

I identified missing values in four columns: children, agent, company, and country. Based on the nature of the data and domain understanding:

1. children (Numerical)
  * **What it represents**:
      Number of children in a booking

  * **Missing likely due to**:
      Data entry issues or oversight

  * **Imputation Strategy**:
      Replacing missing values with the median (since it's numerical and may be skewed).

2. company (Categorical but stored as numeric ID)

  * **What it represents**:
      ID of the company that made the booking

  * **Missing likely means**:
      No company involved in booking
      
  * **Imputation Strategy**:
      Replacing missing values with 0 and treat it as “No Company”.

3. country (Categorical)
  * **What it represents**:
      Country of the guest

  * **Missing likely due to**:
      Unrecorded guest info or country of guest not listed

  * **Imputation Strategy**:
      Replacing missing values with the 'unknown'.

4. agent (Categorical but stored as numeric ID)
  * **What it represents**:
      ID of the travel agent who booked the reservation

  * **Missing likely means**:
      Bookings were made without an agent

  * **Imputation Strategy**:
      Replacing missing values with 0 and treat it as “No Agent

In [None]:
# Replacing the null values with the apropriate values
df.fillna({'children': df['children'].median()}, inplace=True)
df.fillna({'agent' :0}, inplace = True)
df.fillna({'company' :0}, inplace = True)
df.fillna({'country' :'Unknown'}, inplace = True)


# Checking the replacement of null values has worked properly or not
print(f"Total null values in dataset now :\n")
df.isna().sum().reset_index().rename(columns = {'index':'Columns', 0: 'Null Value Count'})

### What did you know about your dataset?

The dataset contains detailed booking information from two types of hotels: City Hotel and Resort Hotel. It consists of over 30 features describing the characteristics of each booking, including booking dates, stay duration, number of guests, pricing, customer demographics, and reservation status.

Here's what I've learned about the dataset:

📌 Nature of Data: The dataset is structured and tabular, with each row representing a single hotel booking.

🏨 Hotel Types: It includes two types of properties — City Hotel and Resort Hotel, allowing for comparative analysis.

📅 Time-Based Features: It has multiple date-related columns such as arrival_date_year, arrival_date_month, and lead_time, enabling time-series and seasonal trend analysis.

👨‍👩‍👧‍👦 Guest Details: The dataset captures the number of adults, children, and babies, which helps in understanding guest composition.

📉 Cancellations: The binary is_canceled column indicates whether a booking was canceled, which is a key target for business insights.

💰 Revenue Info: adr (Average Daily Rate) provides pricing information for each booking.

🌍 Customer & Market Segments: Columns like country, customer_type, market_segment, and distribution_channel help analyze customer origins and booking sources.

🚫 Cancellation History: Features such as is_repeated_guest, previous_cancellations, and days_in_waiting_list provide insights into booking reliability and loyalty.

In summary, this dataset offers a comprehensive view of hotel booking behavior and provides a strong foundation for exploring patterns, trends, and potential business opportunities.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description


***1. hotel*** : *Hotel(Resort Hotel or City Hotel)*

***2. is_canceled*** : Value indicating if the booking was canceled (1) or not (0)*

***3. lead_time*** : *Number of days that elapsed between the entering date of the booking into the PMS(database) and the arrival date*

***4. arrival_date_year*** : *Year of arrival*

***5. arrival_date_month*** : *Month of arrival*

***6. arrival_date_week_number*** : *At which Week guests arrived or going to arrive*

***7. arrival_date_day_of_month*** : *Day of arrival*

***8. stays_in_weekend_nights*** : *Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel*

***9. stays_in_week_nights*** : *Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel*

***10. adults*** : *Number of adults*

***11. children*** : *Number of children*

***12. babies*** : *Number of babies*

***13. meal*** : *Type of meal guests ordered. Categories are presented in standard hospitality meal packages:*

***14. country*** : *Country which guests belong to*

***15. market_segment*** : *Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

***16. distribution_channel*** : *Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

***17. is_repeated_guest*** : *Value indicating if the booking name was from a repeated guest (1) or not (0)*

***18. previous_cancellations*** : *Number of previous bookings that were cancelled by the customer prior to the current booking*

***19. previous_bookings_not_canceled*** : *Number of previous bookings not cancelled by the customer prior to the current booking*

***20. reserved_room_type*** : *Code of room type reserved. Code is presented instead of designation for anonymity reasons.*

***21. assigned_room_type*** : *Code for the type of room assigned to the booking.*

***22. booking_changes*** : *Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation*

***23. deposit_type*** : *Indication on if the customer made a deposit to guarantee the booking.*

***24. agent*** : *ID of the travel agency that made the booking*

***25. company*** : *ID of the company/entity that made the booking or responsible for paying the booking.*

***26. days_in_waiting_list*** : *Number of days the booking was in the waiting list before it was confirmed to the customer*

***27. customer_type*** : *Type of booking, assuming one of four categories*


***28. adr*** : *Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights*

***29. required_car_parking_spaces*** : *Number of car parking spaces required by the customer*

***30. total_of_special_requests*** :* Number of special requests made by the customer (e.g. twin bed or high floor)*

***31. reservation_status*** : *Reservation last status, assuming one of three categories*
* Cancelled - booking was canceled by the customer
* Check-Out - customer has checked in but already departed
* No-Show - customer did not check-in and did inform the hotel about the reason

***32. reservation_status_date*** : *Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel*

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(df.apply(lambda col : col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating a new columns for future use.
df['total_guests'] = df['adults'] + df['children'] + df['babies']
df['total_stays'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']
df['revenue'] = df['adr'] * df['total_stays'] * df['total_guests']


# Converting the data type of existing columns into approriate datatype.
df['children'] = df['children'].astype(int)

# We have separate columns for date, month ad year of arrival date.
# Hence combining all three columns so that we can get complete arrival date in single column.

df['arrival_date'] = df['arrival_date_year'].astype(str) + '/' + df['arrival_date_month'].astype(str) + '/'  + df['arrival_date_day_of_month'].astype(str)

# Converting the datatype of arrival date column from string to datetime.

df['arrival_date'] = pd.to_datetime(df['arrival_date'], format = '%Y/%B/%d')

import calendar
df['arrival_date_month_num'] = df['arrival_date_month'].apply(lambda x: list(calendar.month_name).index(x))

df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

df.head()

### What all manipulations have you done and insights you found?

As part of data wrangling, I created new features and transformed existing ones to make the dataset more analyzable:
1. The most obvious and basic manipulation I have done is dropping the duplicate records.

2. The second most important manipulation made to clean the data is replacing null values with apropriate values. We have only four features that contain null values. The explaination for the replacement of null values is given above

3.  Combined stay duration into a single total_nights column to simplify analysis

4. Aggregated adults, children, and babies into total_guests to reflect full party size

5. Converted arrival_date_month to numerical values for correct time-series ordering

6. Transformed reservation_status_date into datetime format for date-based filtering and visualization

These manipulations were made to simplify feature relationships, uncover new patterns, and prepare the data for meaningful visual and statistical analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

###  Univariate Analysis

#### Chart 1 - Distribution of Bookings by Hotel Type

In [None]:
# Chart 1 - visualization code

# Features used - hotel

# Why this chart ?
# Analyzing hotel types early in EDA gives a high-level view of the dataset structure
# and helps compare patterns across City Hotel and Resort Hotel —
# both of which can behave differently in terms of bookings, revenue, and cancellations.


# Setting up the figure
plt.figure(figsize=(6, 5))

# Setting plot styles
sns.set(style="whitegrid")

# Plotting count of bookings by hotel type
sns.countplot(data=df, x='hotel', palette='coolwarm')


# Setting titles and labels
plt.title('Booking Distribution by Hotel Type', fontsize=14)
plt.xlabel('Hotel Type (City Hotel / Resort Hotel)', fontsize=12, labelpad= 5)
plt.ylabel('Number of Bookings', fontsize=12, labelpad= 10)
plt.tight_layout()

# Showing the plot
plt.show()

##### 1. Why did you pick the specific chart?


* This chart was chosen to understand the distribution of bookings across the two types of hotels in the dataset:

    * City Hotel

    * Resort Hotel

* Since the hotel column is categorical, and we're interested in comparing frequency, a count plot is the most effective and simplest visualization.

* This chart also serves as a foundation for comparing future insights by hotel type, such as cancellation rates, ADR, or seasonal trends.

##### 2. What is/are the insight(s) found from the chart?


* The number of bookings for City Hotel is significantly higher than for Resort Hotel.

* This suggests that City Hotels may be more popular, possibly due to:

    * Business travel

    * Urban tourism

    * Better connectivity or location convenience

* This observation implies that City Hotels are more in demand in the dataset timeframe.

##### 3. Will the gained insights help creating a positive business impact?

Yes. Understanding that City Hotels attract more bookings allows businesses to:

* Focus marketing and promotions where demand is already high

* Allocate more resources and staff to city properties

* Strategically plan for expansion or improvements

It also suggests that different business strategies might be needed for City vs. Resort Hotels.


#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights lead to negative growth potentially. If Resort Hotels are underperforming:

  * They may suffer from seasonal dependency, fewer marketing efforts, or limited accessibility.

  * If ignored, this imbalance can lead to negative growth for resort properties, especially during off-peak seasons.

Hotels may need to investigate:

  * Why the resort bookings are low

  * Whether pricing, visibility, or customer experience is affecting demand

#### Chart 2- Distribution of Booking Cancellations

In [None]:
# Chart 2 - Visualization code

# Fetaures used - is_canceled

# Why this chart ?
# 1. Business Relevance – Cancellations directly impact hotel revenue and resource utilization
#    (empty rooms, loss of opportunity to sell again).
# 2. Customer Behavior Insight – By visualizing cancellations,
#     we can understand whether cancellations are a frequent occurrence
#     or just occasional. This helps hotels optimize their booking policies
#     (e.g., stricter deposit requirements, better refund policies).
# 3. Operational Impact – High cancellations may indicate issues like poor pricing strategy,
#     ineffective communication, or customer dissatisfaction.
#     Identifying this helps in reducing negative growth.


# Creating layout of subplots to enclose figure in single cell.
fig, ax = plt.subplots()

# Setting labels and values
labels = ['Not Cancelled', 'Cancelled']
values = df['is_canceled'].value_counts().values

# Creating Pie Chart.
ax.pie(values, explode=[0,0.1], labels=labels, colors = ['lightskyblue', 'plum'], autopct='%1.2f%%',
       shadow={'ox': -0.02, 'edgecolor': 'black', 'shade': 0.9}, startangle=90)

# Setting title
plt.title('Percnetage of Cancellation')

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* This is a fundamental univariate analysis to understand What percentage of total bookings are being cancelled overall.
* Pie charts are great for showing parts of whole and they're useful when you want to quickly grasp which categories are larger/ smaller at glance.
* Understanding the cancellation rate is critical for hotel operations and revenue planning


##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe:

- A significant proportion of bookings are being cancelled.

- Non-cancelled bookings (0) are higher, but cancellations (1) still make up a notable share, roughly ~38–40% of all bookings.

This insight confirms that cancellation is a non-negligible issue in the dataset and should be analyzed in more depth.

##### 3. Will the gained insights help creating a positive business impact?


Yes, this insight is highly valuable. Knowing that around 4 out of 10 bookings get cancelled allows hotel managers to:

* Introduce better cancellation policies

* Implement revenue management tactics (e.g., overbooking strategies)

* Focus analysis on what causes cancellations to reduce revenue loss

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

* Yes - the high cancellation rate represents potential revenue loss, vacant rooms, and operational inefficiency.

* Negative growth occurs when cancellations aren't predicted or managed — leading to lost bookings that can't be replaced in time.

* Especially if rooms remain vacant due to last-minute cancellations, the hotel may face loss of income and poor resource utilization.

#### Chart 3 - Distribution of Lead Time

In [None]:
# Chart 3 - Visualization code

# Features used - lead_time

# Why this chart ?
# Lead time (number of days between booking and arrival) is strongly related to
# cancellation probability, booking trends, and planning behavior.

# Setting plot size
plt.figure(figsize=(10, 5))

# Plotting histogram of lead time with KDE
sns.histplot(df['lead_time'], bins=50, kde=True, color='skyblue')

# Setting title and labels
plt.title('Distribution of Lead Time (in Days)', fontsize=14)
plt.xlabel('Lead Time (Days between booking and arrival)', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()

# Showing the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE (Kernel Density Estimation) because lead_time is a continuous numerical variable, and we want to understand:

  * The distribution of lead times across all bookings

  * Whether the data is skewed or has outliers

  * The density of bookings made closer vs. further in advance

This chart type is best for uncovering patterns in booking behavior over time.

##### 2. What is/are the insight(s) found from the chart?

* The majority of bookings have a short lead time, typically under 50 days.

* There's a long tail of bookings made 100-300+ days in advance, though these are less frequent.

* The distribution is right-skewed, meaning most people book closer to their arrival date.

This suggests that most guests prefer flexibility or decide late to travel — especially relevant for business travelers in City Hotels.

##### 3. Will the gained insights help creating a positive business impact?

Yes. Understanding lead time helps hotels:

- Forecast demand more accurately and prepare staff/inventory accordingly

- Implement dynamic pricing models (e.g., offer discounts for early bookings or surge pricing for last-minute ones)

- Target marketing for different lead-time segments (e.g., "early birds" vs. "last-minute planners")

This directly improves both revenue management and operational planning.





#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes - The long tail of bookings with very high lead times can lead to higher cancellation rates, which means:

* Lost revenue if cancellations happen too close to arrival

* Vacant rooms that can’t be resold last-minute

* Waste of operational planning (staffing, supplies)

These long lead-time bookings, if not secured with deposits or flexible policies, can contribute to negative growth due to unreliable revenue projection.

#### Chart 4 - Distribution of ADR (Average Daily Rate)

In [None]:
# Chart 4 - Visualization code

# Features used - adr

# Why this chart ?
# The adr (Average Daily Rate) column reflects
# the price charged per room per night —
# it's a core revenue driver for hotels.


# Filtering out extreme outliers for a cleaner plot
filtered_df = df[df['adr'] < 500]  # remove extreme ADR outliers for visualization

# Setting the figure size
plt.figure(figsize=(10, 5))

# Plotting distribution of ADR
sns.histplot(filtered_df['adr'], bins=60, kde=True, color='mediumseagreen')

# Adding titles and labels
plt.title('Distribution of ADR (Average Daily Rate)', fontsize=14)
plt.xlabel('ADR (Room Price per Night in EUR)', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()

# Showing the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram with KDE (Kernel Density Estimation) because adr is a continuous numerical variable and we want to understand:

* The distribution of room prices

* Whether most guests are paying low, medium, or high rates

* The presence of outliers or pricing anomalies

It's important for both revenue analysis and pricing strategy.



##### 2. What is/are the insight(s) found from the chart?

* The majority of bookings fall in the €50–€150 range per night.

* The distribution is right-skewed with a few bookings having ADR > €300 — likely outliers or special cases (e.g., group bookings or luxury rooms).

* There's a smooth density curve indicating relatively consistent pricing strategy with some pricing variance.

This shows that most guests choose moderately priced rooms, and luxury/high-end bookings are much less common.

##### 3. Will the gained insights help creating a positive business impact?


Yes. Understanding ADR helps in:

* Optimizing pricing strategies for different seasons or customer types

* Designing promotional offers around the most common price brackets

* Segmenting customers into price-sensitive vs. premium-paying groups

* Identifying unusual pricing cases or errors in data entry

ADR is directly tied to revenue, so any insight here has high business value.


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes - possible negative indicators:

* Very low ADR values might indicate discounted or loss-making sales, potentially hurting profitability.

* Outlier high ADRs (e.g., > €500) might result in customer dissatisfaction or indicate data entry issues if not valid.

* If pricing is too static, hotels may miss out on dynamic pricing opportunities to maximize profit.

Hence, outlier monitoring and ADR optimization are crucial for long-term growth.


#### Chart 5 - Distribution of Bookings Per Month

In [None]:
# Chart 5 - Visualization code

# Features used - arrival_date_month

# Why this chart ?
# 1. Business Relevance – Hotels operate in a highly seasonal industry.
#     Understanding monthly booking trends helps in forecasting demand, planning staff schedules,
#     managing inventory, and optimizing pricing strategies.
# 2. Customer Behavior Insight – It reveals seasonality and peak/off-peak months.
#     For example, high bookings in holiday months vs. low bookings in off-seasons.
# 3. Positive Business Impact – Insights from monthly booking trends allow hotels to:
#       - Run targeted marketing campaigns during low-demand months.
#       - Adjust pricing strategies to maximize profit during peak months.
#       - Optimize resource allocation (staff, rooms, amenities).
# 4. Negative Growth Signals – If certain months consistently show low bookings,
#      it may indicate poor promotions, lack of demand, or competitive disadvantage
#      during that season. This insight can prevent revenue leakage.


# Setting the theme for better visuals
sns.set_theme(style="whitegrid")

# Step 1: Creating a new column for month order to sort properly
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']

# Step 2: Grouping by month to get booking counts
monthly_bookings = df['arrival_date_month'].value_counts().reindex(month_order)

# Step 3: Plotting the bar chart
plt.figure(figsize=(12,6))
sns.barplot(x=monthly_bookings.index, y=monthly_bookings.values, palette='YlGnBu')

# Adding labels and title
plt.title('Number of Bookings Per Month', fontsize=16, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Number of Bookings')

# Annotating values on bars
for i, value in enumerate(monthly_bookings.values):
    plt.text(i, value + 200, str(value), ha='center', va='bottom', fontweight='bold')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a **bar** chart to represent the number of bookings per month because :

* Bar charts are ideal for comparing categorical data like months.

* Since hotel booking is a time-sensitive business, visualizing seasonal trends across months is crucial to identifying **peak** and **off-peak periods**.

* This chart helps easily spot fluctuations in customer volume over the year.

##### 2. What is/are the insight(s) found from the chart?

* The chart typically shows that July and August have the highest number of bookings, suggesting peak tourist season in those months.

* April and May usually also see moderate booking levels, likely due to holidays and New Year travel.

* Months like January and November often have lower bookings, indicating off-peak periods.

* These patterns help us understand customer seasonality, travel preferences, and help prepare for resource allocation during peak times.

##### 3. Will the gained insights help creating a positive business impact?


Yes, absolutely. This insight is directly tied to revenue planning and inventory management:

* Hotels can increase prices during peak months to maximize profit (yield management).

* Staffing and inventory (like room service, amenities, parking, etc.) can be increased in advance for busy periods.

* During off-peak months, hotels can run discounted campaigns to increase occupancy or partner with travel agencies.


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If bookings are consistently low in certain months (e.g., November or March), this could indicate:

* Lack of promotions during those months.

* Unfavorable weather or local conditions.

* Poor visibility in travel portals during those times.

Not acting on these low-performing months could lead to negative growth. But with this insight, hotels can strategically boost bookings through offers, partnerships, or local events.


#### Chart 6 - Distribution of Bookings across Market Segments

In [None]:
# Chart 6 - Visualisation Code.

# Features used - market_segment

# Why this chart ?
# The market_segment column tells where the booking came from —
# e.g., online travel agencies, direct bookings, corporate bookings, etc.
# This is extremely valuable for marketing, sales strategy,
# and customer acquisition cost analysis.


# Storing value of total number of rows for further use
rows = len(df)

# Creating subplot
fig, ax = plt.subplots(figsize = (10,5))

# Filtering dataset and clipping neccessary data.
x = df['market_segment'].value_counts().index
y = [round((i*100 / rows),1) for i in df['market_segment'].value_counts().values]

# Setting random sizes for markers
s = (np.random.randint(150,200,len(x)))

# Customising grid and axes.
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True)
ax.xaxis.grid(False)

# Characterising Plot.
plt.scatter(x, y ,marker = 'o', s=s, alpha=0.5, c=np.random.rand(len(x)))

# Code to display title and labels for x,y axes.
plt.title('Distribution of Bookings Over Market Segment', fontdict={'fontsize' : 15})
plt.xlabel('Market Segment', fontdict={'fontsize' : 15} , labelpad = 10 )
plt.ylabel('Perecentage of Total Bookings', fontdict={'fontsize' : 15})

# Code for showing values of marker.
for i in range(len(x)):
    plt.annotate(f"{y[i]}%", (i-0.2, y[i]+3), fontsize = 11)

# Tweak spacing to prevent clipping of ylabel
plt.subplots_adjust(left=0.15)

# Setting limits for y axis.
ax.set_ylim(-10,70)

ax.tick_params(axis = 'x', length=4, labelrotation = 20, pad = 2 ,
                       labelcolor='black', grid_color='b')

ax.xaxis.set_major_locator(mpl.ticker.MultipleLocator(base=1))

# Plotting chart.
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because understanding booking sources is essential for:

* Evaluating where customers are coming from

* Identifying the most effective booking channels

* Recognizing opportunities to increase direct bookings (which often have lower acquisition cost)

A scatter plot is appropriate since market_segment is a categorical feature and we're comparing frequency of each category.



##### 2. What is/are the insight(s) found from the chart?

* Online Travel Agents (OTA) dominate the bookings by far, followed by offline travel agents and direct bookings.

* Segments like corporate, groups, and complementary are relatively smaller.

* Direct bookings are much fewer, which might mean higher dependency on third-party platforms.

This reveals which marketing and distribution channels the business relies on most.

##### 3. Will the gained insights help creating a positive business impact?


Absolutely. This insight can guide:

* Marketing spend optimization (e.g., spend more on high-converting channels)

* Reducing OTA dependency by increasing direct booking offers

* Building corporate partnerships or expanding underperforming segments

* Tailoring customer service to segment expectations

Booking source is a key strategic driver in hospitality.


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes - Over-reliance on third-party platforms like OTAs can:

- Eat into profit margins due to commission fees

- Reduce brand loyalty and customer retention

- Cause pricing pressure due to platform competition

If direct and corporate bookings remain low, hotels miss out on higher-margin business, which can impact long-term growth.

#### Chart 7 - Meal Preferences Distribution

In [None]:
# Chart 7 - Visualization code

# Features used - meal

# Why this chart ?
# Hotels not only earn from room bookings but also from F&B (Food & Beverages) services.
# Knowing which meal types (BB – Bed & Breakfast, HB – Half Board, FB – Full Board,
# SC – Self Catering, etc.) are preferred gives valuable insight into guest consumption behavior.
# Hotels can optimize kitchen operations and staffing according to demand.
# Allows better menu planning & cost control.
# Helps in designing custom offers/packages

# Getting value counts for meal types
meal_counts = df['meal'].value_counts()
labels = meal_counts.index
sizes = meal_counts.values
total = sizes.sum()

# Preparing legend labels: Meal Type (Count - %)
legend_labels = [
    f"{label} ({count} - {count/total:.1%})"
    for label, count in zip(labels, sizes)
]

# Setting up figure
plt.figure(figsize=(8, 8))

# Drawing pie chart (percent only inside pie)
wedges, _, _ = plt.pie(
    sizes,
    autopct='%1.1f%%',
    startangle=140,
    colors=plt.cm.Pastel1.colors,
    textprops=dict(color="black", fontsize=11)
)

# Adding legend with full labels
plt.legend(
    wedges, legend_labels,
    title='Meal Types',
    loc='center left',
    bbox_to_anchor=(1, 0.5),
    fontsize=11
)

# Setting title
plt.title('Meal Plan Distribution Among Guests', fontsize=14)

# Final layout
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pie chart for this categorical variable because:

* It clearly communicates proportions of meal plans

* It's visually intuitive and effective for variables with a few distinct categories



##### 2. What is/are the insight(s) found from the chart?

* The most popular meal plan is BB (Bed & Breakfast), making up the majority of bookings.

* Other meal types like HB (Half Board) and FB (Full Board) are much less common.

* A small portion of bookings had no meal plan.

This indicates that guests prefer simpler meal options, or maybe that breakfast is bundled more often than full meals.

##### 3. Will the gained insights help creating a positive business impact?


Yes. The hotel can:

* Focus on optimizing breakfast services, since it's in high demand

* Offer upsell campaigns to convert BB customers to HB/FB plans

* Rethink pricing for meal plans if certain types are underperforming

It also helps with kitchen resource planning and menu design.


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes -
* Low adoption of HB/FB may suggest that the meal pricing or menu variety isn't attractive enough

* If meal plans aren't contributing to profit, the hotel could be missing out on cross-selling opportunities

* Having too many underused meal options increases operational complexity and food waste

Understanding this distribution can help hotels simplify offerings and increase F&B profitability.

#### Chart 8 - Top 10 Countries by Guest Count

In [None]:
# Chart 8 - Visualization code

# Features used - country

# Why this chart ?
# 1. Important for geographic segmentation
# 2. Helps identify profitable markets and potential for region-specific promotions
# 3. Shows how well the hotel brand is performing internationally vs. domestically
# 4. Essential for making decisions around multi-lingual support,
#    currency acceptance, and marketing focus

# Creating a DataFrame for top 10 countries
country_df = df['country'].value_counts().sort_values(ascending=False).reset_index().head(10)
country_df.columns = ['country', 'count']

# Calculating percentage of bookings per country
X = country_df['country']
Y = [round(((i * 100) / rows), 2) for i in country_df['count']]

# Plotting setup
fig, ax = plt.subplots(figsize=(10, 6))
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#C5C9C7')
ax.xaxis.grid(False)

# Plotting bars
bar = ax.bar(X, Y)
plt.title('Top 10 Countries Maximum Guests Come From', pad=20, fontsize=15)
ax.set_xlabel("Country", fontsize=15, labelpad=15)
ax.set_ylabel("Percentage of Total Bookings", fontsize=15, labelpad=15)

# Gradient color function
def gradientbars(bars, cmap):
    ax = bars[0].axes
    lim = ax.get_xlim() + (0, 35)
    i = 0
    for bar in bars:
        bar.set_zorder(1)
        bar.set_facecolor('none')
        x, y = bar.get_xy()
        w, h = bar.get_width(), bar.get_height()
        grad = np.atleast_2d(np.linspace(0, 1, 256)).T
        ax.imshow(grad, extent=[x+w, x, y, y+h], aspect='auto', zorder=1, cmap=plt.get_cmap(cmap))
        plt.annotate(f"{Y[i]}%", (x+0.1, y+h+0.5), fontsize=12)
        i += 1
    ax.axis(lim)

gradientbars(bar, 'winter')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* I chose this chart to understand which countries contribute the most guests to the hotels.

* A bar chart with country names is ideal for comparing geographic proportions.

* Since business decisions vary significantly across regions (pricing, promotions, language support), this insight is crucial for hotel strategy



##### 2. What is/are the insight(s) found from the chart?

* The majority of guests come from a few specific countries, such as Portugal, the UK, and France (based on typical trends).

* The top 10 countries often account for a large portion of total bookings, indicating a concentration of market activity.

* This helps identify key markets where the hotel is already strong.

##### 3. Will the gained insights help creating a positive business impact?


Yes. These insights allow hotel businesses to:

* Target advertisements and promotions in high-performing countries

* Focus customer support efforts (e.g., language-specific services)

* Expand presence in regions where the brand is already gaining traction

* Design geo-specific offers (seasonal discounts, loyalty points, etc.)


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If certain countries never appear in the top list, it may indicate:

* Poor market presence in those regions

* Lack of brand awareness

* Failure to support international payment, language, or travel needs

Without addressing this gap, hotels may miss out on diversification and growth potential, especially from emerging travel markets.

#### Chart 9- Distribution of Total Guests

In [None]:
# Chart 9 - Visualization code

# Features used :- total_guests

# Why this chart ?
# Helps in capacity planning and upselling opportunities (family rooms, larger suites, adjoining rooms).
# Identifies target customer segments (e.g., if mostly business travelers book single occupancy).

# Filtering to remove bookings with zero guests (just in case)
filtered_df = df[df['total_guests'] > 0]

# Setting figure size
plt.figure(figsize=(10, 5))

# Creating strip plot
sns.stripplot(data=filtered_df, x='total_guests', color='teal', size=4, jitter=True)

# Adding titles and labels
plt.title('Distribution of Total Guests per Booking', fontsize=14)
plt.xlabel('Total Number of Guests (Adults + Children + Babies)', fontsize=12)
plt.ylabel('Density of Bookings', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()

# Showing the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A strip plot shows the spread and density of a discrete numerical variable, which makes it great for total_guests.

* Unlike histograms, it shows each data point and gives you a feel for concentration and outliers.

##### 2. What is/are the insight(s) found from the chart?

* Most bookings are for 1-2 guests, which is expected (solo travelers, couples).

* There are noticeable frequencies at 3 and 4 guests, indicating small families.

* Higher guest counts (5-10+) are very rare but present — possibly group bookings or large families.

##### 3. Will the gained insights help creating a positive business impact?


Yes! This data helps hotels:

* Understand room capacity demand (single, double, family rooms)

* Plan amenities (extra beds, cribs, etc.)

* Optimize group vs. individual marketing strategies

If 2-guest stays dominate, hotels can focus offers/packages toward couples or business travelers


##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes - Very few large bookings may suggest the hotel is not appealing to group travelers.

- If families feel constrained by room sizes or policies, they may choose other properties.

- This could indicate missed revenue from larger-room upsells or group events.

The hotel might consider targeting small groups or family bundles to grow this segment.

#### Chart 10 - Deposit Type Distribution

In [None]:
# Chart 10 - Visualisation Code

# Features used - deposit_type

# Why this chart ?
# This column indicates whether a customer made a
# No Deposit, Non-Refundable, or Refundable deposit when booking.
# It has direct implications for cancellations, revenue assurance, and customer trust.

# Setting figure size
plt.figure(figsize=(7, 5))

# Creating count plot
sns.countplot(data=df, x='deposit_type', palette='Set2', order=df['deposit_type'].value_counts().index)

# Adding titles and labels
plt.title('Distribution of Deposit Types', fontsize=14)
plt.xlabel('Deposit Type', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.tight_layout()

# Showing the chart
plt.show()


##### 1. Why did you pick the specific chart?

- Deposit policy is a key business lever: it affects cash flow, cancellation rates, and booking confidence.

- A count plot clearly shows the preference or dominance of certain deposit types.

This chart sets the stage for deeper future bivariate insights like deposit type vs cancellation rate.


##### 2. What is/are the insight(s) found from the chart?

- The vast majority of bookings are made with No Deposit.

- Non-Refundable deposits are a small minority.

- Refundable deposits are extremely rare.

This suggests that the hotel allows flexible booking for most customers — likely to attract volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. The business can use this insight to:

- Evaluate the risk associated with cancellations

- Design prepayment policies for specific customer segments

- Test non-refundable incentives (e.g., offer small discounts for advance payment)

It also helps in financial planning and forecasting.

##### Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- Heavy reliance on No Deposit bookings makes the business vulnerable to last-minute cancellations.

- This can lead to lost revenue, unbooked inventory, and operational inefficiencies.

- A shift towards smartly positioned non-refundable policies could help secure cash flow.

- The key is balance — don’t scare off customers, but don’t stay overly flexible either.

#### Chart 11 - Distribution of Booking Changes

In [None]:
# Chart 11 -  visualization code

# Features used - booking_changes

# Why this chart ?
# The booking_changes column tells us how many times a guest modified their reservation —
# this affects staffing, room assignments, and even cancellation probabilities.


# Preparing data: counting booking changes and filtering to top values
change_counts = df['booking_changes'].value_counts().sort_index()
filtered_changes = change_counts[change_counts.index < 10]  # limit to 0–9 changes

# Setting plot size
plt.figure(figsize=(8, 6))

# Creating horizontal barplot
bars = plt.barh(
    y=filtered_changes.index.astype(str),
    width=filtered_changes.values,
    color='skyblue',
    edgecolor='black'
)

# Annotating bars with values
for bar in bars:
    plt.text(bar.get_width() + 50, bar.get_y() + bar.get_height()/2,
             f'{int(bar.get_width())}', va='center', fontsize=10)

# Titles and labels
plt.title('Number of Bookings by Booking Changes Count', fontsize=14)
plt.xlabel('Number of Bookings', fontsize=12)
plt.ylabel('Number of Changes Made', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()

# Show the chart
plt.show()


##### 1. Why did you pick the specific chart?

* This feature is numerical but discrete, so a bar plot shows frequency distribution more clearly than histogram.

* booking_changes reveals guest behavior, operational friction, and can even signal booking uncertainty.

##### 2. What is/are the insight(s) found from the chart?

* Most guests make zero changes to their booking.

* A smaller percentage make 1 or 2 changes.

* Very few bookings involve 3+ changes, indicating edge cases or indecisive customers.

This tells us the majority of bookings are stable once confirmed.


##### 3. Will the gained insights help creating a positive business impact?


Yes:

* Hotels can identify which customer types or channels lead to more modifications

* Minimize room shuffling and reassignments due to high-change bookings

* Offer better UX in booking systems to reduce the need for changes

Understanding this behavior improves staff planning and guest experience.



#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

* Too many booking changes increase administrative overhead and room allocation complexity

* For high-change guests, hotels might incur extra costs or overbookings

* Could also be linked to higher cancellation risk

Hotels might introduce limits or small fees for excessive changes to balance flexibility and cost.

#### Chart 12 - Top 10 Agents Contributing to Business

In [None]:
# Chart 12 -  Visualization code

# Features used - agent

# Why this chart ?
# - Booking agents are crucial intermediaries in the hotel industry —
#   they directly influence the volume and type of bookings.
# - Some agents bring high-volume but high-cancellation customers (e.g., OTAs),
#   while others (like corporate agents) may bring fewer but more reliable bookings.
# - Identifying the top 10 agents helps the hotel:
#       1. Build stronger partnerships
#       2. Negotiate better commission rates
#       3. Decide where to invest in B2B marketing


# Creating layout
fig, ax = plt.subplots(figsize=(10, 6))

# Step 1: Preparing the data
agent_df = df['agent'].value_counts().reset_index()
agent_df.columns = ['Agent', 'Number of Bookings']
agent_df = agent_df.sort_values(by='Number of Bookings', ascending=False).head(10)

# Step 2: X and Y for plotting
x = agent_df['Agent'].astype(str)
y = agent_df['Number of Bookings'].apply(lambda i: round(i * 100 / len(df), 2))

# Step 3: Creating stem plot
plt.stem(x, y, linefmt='darkorchid', markerfmt='D', basefmt=" ")

# Annotating each point
for i in range(len(x)):
    plt.annotate(f"{y[i]}%", (i - 0.3, y[i] + 1), fontsize=11)

# Chart styling
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#C5C9C7')
ax.xaxis.grid(False)
ax.set_ylim(0, max(y) + 10)

# Labels and title
ax.set_xlabel('Agent ID')
ax.set_ylabel('Percentage of Total Bookings')
ax.set_title('Top 10 Agents Who Made Maximum Bookings', pad=10)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart to identify the top-performing agents who bring the highest number of bookings.

* A stem plot offers a clean and vertical visual representation, helping to highlight individual contributions clearly without clutter.

* Since agents play a key role in the booking funnel, this analysis is crucial.

##### 2. What is/are the insight(s) found from the chart?

* The top 10 agents contribute a significant percentage of all bookings, suggesting that a small number of agents are responsible for the majority of sales.

* Some agents are high-volume contributors, and possibly tied to corporate or OTA platforms.

* There's a long-tail distribution, where most agents contribute only a few bookings.


##### 3. Will the gained insights help creating a positive business impact?


Yes:

* Hotels can identify which customer types or channels lead to more modifications

* Minimize room shuffling and reassignments due to high-change bookings

* Offer better UX in booking systems to reduce the need for changes

Understanding this behavior improves staff planning and guest experience.



##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

* Too many booking changes increase administrative overhead and room allocation complexity

* For high-change guests, hotels might incur extra costs or overbookings

* Could also be linked to higher cancellation risk

Hotels might introduce limits or small fees for excessive changes to balance flexibility and cost.

#### Chart 13 - Distribution of Bookings across Distribution Channels

In [None]:
# Chart 13 - Visualization code

# Features used :- distribution_channel

# Why this chart ?
# Hotels rely heavily on distribution mix.
# Too much reliance on OTAs (Online Travel Agencies like Booking.com, Expedia)
# may reduce profit margins due to commissions.
# A healthy share of direct bookings increases profitability.
# Corporate channel strength indicates long-term business contracts.

# Counting values for distribution_channel
channel_counts = df['distribution_channel'].value_counts()

# Plotting setup
plt.figure(figsize=(8, 5))

# Barplot sorted by frequency
sns.barplot(x=channel_counts.index, y=channel_counts.values, palette='Set3')

# Annotating bars with values
for i, value in enumerate(channel_counts.values):
    plt.text(i, value + 500, str(value), ha='center', fontsize=10)

# Labels and title
plt.title('Booking Distribution by Distribution Channel', fontsize=14)
plt.xlabel('Distribution Channel', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.tight_layout()

# Showing plot
plt.show()

##### 1. Why did you pick the specific chart?

* This feature represents how customers are acquired, which is central to sales strategy and partnership decisions.

* A bar plot works best here as we’re comparing absolute frequencies across categorical values.

* This version includes value labels for executive-style reporting.

##### 2. What is/are the insight(s) found from the chart?

* The majority of bookings come through the TA/TO (Travel Agent/Tour Operator) channel.

* The Direct channel is significant but not dominant.

* Corporate and GDS (Global Distribution Systems) are small contributors.

This tells us the hotel heavily relies on intermediaries to get business.

##### 3. Will the gained insights help creating a positive business impact?


Absolutely:

* Encourages investment in strengthening the direct channel (more profit, better control).

* Allows targeted campaigns for underused channels like corporate partnerships.

* Helps reduce dependency on travel agents, which often come with high commission.

Optimizing channel strategy directly boosts profit margins and guest loyalty.

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

* Over-reliance on TA/TO can hurt profitability due to high acquisition cost.

* Too little direct booking traffic limits opportunities for brand engagement and upselling.

* It makes the hotel vulnerable to OTA platform policies and price wars.

Improving direct booking flow (via web/app/loyalty programs) is a strategic must.

#### Chart 14 - Room Type Preferred by Customers

In [None]:
# Chart 14 - Visualization code

# Features used - room_type

# Why this chart ?
# Understanding which room types are most preferred (reserved) by customers provides:
#   - Insights into customer preferences
#   - Opportunity for price optimization
#   - Helps forecast demand for inventory management
#   - Identifies any misalignment between what guests reserve vs. what they’re assigned


# Setting the style
sns.set_style("whitegrid")

# Counting plot of reserved room types
plt.figure(figsize=(10,6))
ax = sns.countplot(data=df, x='reserved_room_type', order=df['reserved_room_type'].value_counts().index, palette="crest")

# Adding counts on top of bars
for container in ax.containers:
    ax.bar_label(container, fmt='%d', label_type='edge', padding=2, fontsize=11)

# Title and labels
plt.title('Room Type Preferred by Customers', fontsize=16, fontweight='bold', pad=15)
plt.xlabel('Reserved Room Type', fontsize=13)
plt.ylabel('Number of Bookings', fontsize=13)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
plt.tight_layout()

# Showing the plot
plt.show()


##### 1. Why did you pick the specific chart?

* I chose a bar chart because it is perfect for analyzing frequency of categorical data like room types.

* It helps us clearly see which room categories are most popular among customers.

* Understanding booking patterns by room type is vital for inventory and pricing strategy in the hospitality sector.

##### 2. What is/are the insight(s) found from the chart?

* 'A' and 'D' room types are clearly preferred by most customers.

* Some room types like 'L', 'P' and 'H' have very low demand, which may indicate either poor marketing, unsatisfactory features, or overpricing.

* A mismatch between reserved and assigned room types may need further analysis to identify operational issues.

##### 3. Will the gained insights help creating a positive business impact?


Yes, absolutely. Insights like this can:

* Help hotels optimize their room inventory according to demand

* Allow dynamic pricing on popular room types to boost revenue

* Improve marketing focus on underperforming room types

* Enhance customer satisfaction by aligning room supply with preferences

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

* If the hotel consistently under-supplies high-demand room types, it leads to missed revenue opportunities and guest dissatisfaction.

* Also, under-booked room types consume maintenance and staffing costs without returns.

* This chart highlights potential misallocation of resources that can result in negative business growth if left unaddressed.

#### Chart 15 - Percentage Distribution of Bookings by Customer Type

In [None]:
# Chart 15 - Visualization code

# Features used :- customer_type

# Why this chart ?
# Customer type (Transient, Group, Contract, Transient-party) directly reflects
# the nature of guests staying at the hotel.
# Hotels can optimize pricing, marketing, and service strategies
# based on the dominant customer type.

# Calculating percentage distribution of customer types
customer_type_counts = df['customer_type'].value_counts(normalize=True) * 100
customer_type_percent = customer_type_counts.reset_index()
customer_type_percent.columns = ['Customer Type', 'Percentage']

# Creating the percentage bar chart
plt.figure(figsize=(8,6))
ax = sns.barplot(
    x='Customer Type',
    y='Percentage',
    data=customer_type_percent,
    palette='Set2'
)

# Adding title and labels
plt.title("Percentage Distribution of Bookings by Customer Type", fontsize=14, fontweight="bold")
plt.xlabel("Customer Type", fontsize=12)
plt.ylabel("Percentage of Bookings (%)", fontsize=12)
plt.xticks(rotation=30)

# Annotating bars with percentages
for p in ax.patches:
    ax.annotate(
        f'{p.get_height():.1f}%',
        (p.get_x() + p.get_width() / 2., p.get_height()),
        ha='center', va='bottom',
        fontsize=10, color='black', fontweight='bold'
    )

plt.show()

##### 1. Why did you pick the specific chart?

* I chose a percentage bar chart instead of a raw count chart because it gives a clearer view of the relative share of each customer type.
* This makes it easier to compare categories and prioritize business strategies accordingly.

##### 2. What is/are the insight(s) found from the chart?

* We can clearly see which customer type dominates hotel bookings (e.g., Transient customers often make up the majority, while Contract or Group bookings contribute much less).

##### 3. Will the gained insights help creating a positive business impact?


Yes.

- By knowing the dominant customer type, hotels can optimize marketing efforts and personalize offers.


- For example, if transient customers form ~75% of bookings, hotels should invest in dynamic pricing, targeted digital ads, and loyalty programs for solo travelers/couples.

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

- If one segment (like groups or contracts) contributes very little, it shows underutilized business opportunities.

- For instance, low group bookings may highlight poor partnerships with corporates or travel agencies.

- This insight, while negative, helps management identify untapped revenue streams.

#### Chart 16 - Percentage Distribution of Repeated vs Non-Repeated Guests (is_repeated)

In [None]:
# Chart 16 - Visualization code

# Features used :- is_repeated

# Why this chart?
# Repeat guests are critical for business as they indicate customer loyalty
# and satisfaction. A simple binary bar chart (with percentages) makes
# retention trends easy to interpret and present.

# Mapping binary values to more readable labels
df['is_repeated_label'] = df['is_repeated_guest'].map({0: 'Non-Repeated Guests', 1: 'Repeated Guests'})

# Calculating percentage distribution
repeat_counts = df['is_repeated_label'].value_counts(normalize=True) * 100
repeat_percent = repeat_counts.reset_index()
repeat_percent.columns = ['Guest Type', 'Percentage']

# Creating percentage bar chart
plt.figure(figsize=(7,6))
ax = sns.barplot(
    x='Guest Type',
    y='Percentage',
    data=repeat_percent,
    palette='coolwarm'
)

# Adding title and labels
plt.title("Percentage Distribution of Repeated vs Non-Repeated Guests", fontsize=14, fontweight="bold")
plt.xlabel("Guest Type", fontsize=12)
plt.ylabel("Percentage of Bookings (%)", fontsize=12)

# Annotating bars with percentages
for p in ax.patches:
    ax.annotate(
        f'{p.get_height():.1f}%',
        (p.get_x() + p.get_width() / 2., p.get_height()),
        ha='center', va='bottom',
        fontsize=10, color='black', fontweight='bold'
    )

plt.show()


##### 1. Why did you pick the specific chart?

* I chose a percentage bar chart because it makes the comparison between repeated and non-repeated guests clear, while showing the proportion rather than just raw counts.

* Percentages give a business-oriented understanding of loyalty rates.

##### 2. What is/are the insight(s) found from the chart?

* Typically, non-repeated guests dominate, with repeated guests forming a very small percentage.

* This suggests that most customers book only once and do not return, which highlights an opportunity for improvement in retention strategies.

##### 3. Will the gained insights help creating a positive business impact?


Yes.

* By identifying the small proportion of repeat guests, hotel management can design loyalty programs, discounts, or personalized offers to increase customer retention.

* Even a small improvement in repeat guest percentage can significantly improve long-term revenue.

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

* The insight that majority of guests are non-repeated is a negative growth factor because it shows low loyalty and high dependency on constant new customer acquisition, which increases marketing costs.

* However, this negative insight provides a strategic direction to build retention strategies.

#### Chart 17 - Distribution of Total Stay Duration

In [None]:
# Chart 17 - Visualization code

#Features used :- total_stays

# Why this chart ?
# Hotels care a lot about this because longer stays mean higher revenue
# per booking but also risk of cancellations if too long.
# From an EDA perspective, this feature helps in spotting unusual values
# (e.g., guests staying 60+ nights → corporate bookings or data anomalies).

# Creating histogram with KDE to show the distribution
plt.figure(figsize=(10,6))
sns.histplot(df['total_stays'], bins=30, kde=True, color='teal')

# Adding labels, title, and styling
plt.title("Distribution of Total Stay Duration", fontsize=16, pad=15)
plt.xlabel("Total Nights Stayed", fontsize=14, labelpad=10)
plt.ylabel("Number of Bookings", fontsize=14, labelpad=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Showing the chart
plt.show()


##### 1. Why did you pick the specific chart?

* I used a histogram with KDE because the variable total_stays is continuous (number of nights).

* A histogram clearly shows the frequency distribution of different stay durations, while KDE smoothens the curve to highlight trends (short vs long stays).

##### 2. What is/are the insight(s) found from the chart?

* Most bookings are for short stays (1-7 nights).

* Very few customers book stays longer than 15-20 nights.

* The distribution is highly right-skewed, meaning outliers exist where some guests stayed for extremely long durations (possibly monthly stays or anomalies).

##### 3. Will the gained insights help creating a positive business impact?


Yes

* Hotels can design packages (weekend offers, 1-week discounts) since most stays are short-term.

* By understanding the average stay duration, hotels can optimize inventory and pricing strategies.

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Potentially yes.

* Very long stays (30+ nights) may block room availability for higher-paying short-term guests, leading to lost revenue opportunities.

* Hotels may need to review such cases (are they corporate bulk bookings or system errors?) to avoid misallocation of resources.

#### Chart 18 - Distribution of Revenue

In [None]:
# Chart 18 - Visualization code

#Features used :- revenue

# Why this chart ?
# Revenue is the most critical business metric → it directly reflects hotel profitability.
# Visualizing its distribution highlights how much revenue most bookings generate,
# and whether revenue is concentrated in a small number of bookings (e.g., luxury packages) or
# more evenly spread.

# Note -
# Revenue distribution is usually heavily right-skewed
# (a few very large bookings can distort the view).
# Hence making log transformation to properly visualize,
# otherwise small-to-medium revenue bookings may appear compressed.


# Cleaning data
# Removing records where revenue is zero or negative (invalid/technical entries)
revenue_data = df[df['revenue'] > 0]['revenue']

# Creating histogram
plt.figure(figsize=(10,6))
plt.hist(revenue_data, bins=100, color="skyblue", edgecolor="black", alpha=0.7, log=True)

# Chart formatting
plt.title("Revenue Distribution per Booking", fontsize=16, pad=15)
plt.xlabel("Revenue (log scale)", fontsize=14, labelpad=10)
plt.ylabel("Number of Bookings (log scale)", fontsize=14, labelpad=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()


##### 1. Why did you pick the specific chart?

* I chose a log-scaled histogram because revenue data is typically highly skewed (most bookings bring in small-to-moderate revenue, while a few high-value bookings generate extremely large revenue).


* Using log scaling ensures both low-value and high-value bookings are visible, preventing distortion.

##### 2. What is/are the insight(s) found from the chart?

* The majority of bookings fall into low-to-moderate revenue ranges.

* A small proportion of bookings contribute very high revenue, creating a long tail distribution.

* This indicates that hotels may rely heavily on a small group of premium customers or large group bookings for significant revenue share.

##### 3. Will the gained insights help creating a positive business impact?


Yes. These insights can guide:

* **Revenue strategy:** Hotels can focus marketing on high-value customer segments to maximize profit.

* **Dynamic pricing:** Identifying ranges where most customers fall helps refine ADR pricing strategy.

* **Promotions:** For low-revenue groups, targeted upselling (room upgrades, meals, packages) can boost profitability

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If too much dependence is on a few high-value bookings, it creates revenue risk:

* If these customers stop booking, revenue may drop disproportionately.

* Heavy reliance on discounts to attract smaller bookings might also reduce average profitability.

* This means hotels must balance their customer mix — attract more mid-range customers while retaining premium ones.

###  Bivariate Analysis

#### Chart 19 - Distribution of ADR by Hotel Type

In [None]:
# Chart 19 - Visualization code

# Features used :- adr, hotel

# Why this chart ?
# ADR is directly tied to revenue and pricing strategy.
# Comparing ADR across hotel types (City Hotel vs. Resort Hotel) reveals how each segment is priced
# and which category drives higher revenue per night.
# It will show whether one hotel type consistently charges more and how wide the variation is.
# These insights can guide revenue management teams in adjusting pricing strategies,
# offering discounts, or upselling to maximize occupancy and profit.

# Removing extreme ADRs for clean shape
df_violin = df[df['adr'] < 500]  # Filtering out extreme outliers

# Setting plot size
plt.figure(figsize=(8, 6))

# Creating violin plot
sns.violinplot(
    data=df_violin,
    x='hotel',
    y='adr',
    palette='muted',
    inner='box',  # also show a mini boxplot inside
    scale='count'  # width based on number of observations
)

# Labels and title
plt.title('ADR Distribution by Hotel Type', fontsize=14)
plt.xlabel('Hotel Type', fontsize=12)
plt.ylabel('Average Daily Rate (ADR)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.tight_layout()

# Showing plot
plt.show()


##### 1. Why did you pick the specific chart?

- A violin plot combines the strengths of box plots and KDE plots.

- It shows where most bookings are priced, how prices differ by hotel type, and how consistent the pricing strategy is.


##### 2. What is/are the insight(s) found from the chart?

* City hotels have a wider price spread, with more high-priced bookings (even above €200).

* Resort hotels tend to have more clustered pricing, mostly below €150.

* The City Hotel's violin is fatter, showing more bookings overall and greater pricing variation.

This indicates City Hotels might be applying dynamic pricing, while Resort Hotels may use stable rate packages.



##### 3. Will the gained insights help creating a positive business impact?

Yes:

* The business can analyze rate volatility and customer sensitivity by hotel type

* Helps refine pricing strategy — e.g., consider more dynamic pricing for resorts during peak seasons

* Assists in revenue prediction modeling based on property type

Hotels can better align their pricing models with customer expectations and booking patterns.

#####4. Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

* Too much pricing variation in city hotels may confuse customers or lead to dissatisfaction.

* Resort hotels may miss out on revenue by keeping ADRs too flat even when demand surges.

* Over-discounting city hotel rooms might devalue the brand or lead to profit loss during low seasons.

Insights like this help prevent pricing missteps and optimize strategy per hotel type.

#### Chart 20 - Repeated Guest vs Cancellation Rate

In [None]:
# Chart 20 - Visualisation Code

# Features used :- is_repeated_guest, is_cancelled

# Why this chart ?
# This chart answers:
#     * Are repeat guests more reliable in terms of cancellations?
#     * Does guest loyalty impact booking retention?
# This is an extremely powerful metric for understanding
# customer behavior and strategizing loyalty programs

# Creating a grouped DataFrame
repeat_cancel = df.groupby(['is_repeated_guest', 'is_canceled']).size().unstack()

# Normalizing to percentage
repeat_cancel_percent = repeat_cancel.div(repeat_cancel.sum(axis=1), axis=0) * 100

# Plotting grouped bar chart
ax = repeat_cancel_percent.plot(
    kind='bar',
    figsize=(8, 6),
    color=['mediumseagreen', 'salmon']
)

# Styling
plt.title('Cancellation Rate by Guest Type (Repeat vs New)', fontsize=15, weight='bold')
plt.xlabel('Is Repeated Guest', fontsize=12)
plt.ylabel('Percentage (%)', fontsize=12)
plt.xticks(ticks=[0, 1], labels=['New Guest', 'Repeat Guest'], rotation=0)
plt.legend(['Not Canceled', 'Canceled'], title='Booking Status')
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* It gives deep behavioral insight about guest loyalty and booking trustworthiness.

* Hotels invest heavily in loyalty programs — this helps justify or tweak those efforts.

* A grouped percentage bar plot is clean and allows easy comparison.

##### 2. What is/are the insight(s) found from the chart?


* Repeat guests rarely cancel — cancellation rate is often under 5-10%.

* New guests are far more likely to cancel, sometimes over 40-50%.

* Loyalty equals reliability — repeat guests are consistent and predictable.

##### 3. Will the gained insights help creating a positive business impact?


Absolutely:

* Hotels should prioritize nurturing repeat guests — they are lower risk, cheaper to retain, and more reliable.

* Encourage first-time guests to sign up for loyalty programs to convert them.

* Helps CRM and marketing teams identify churn-risk profiles early.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


Yes:

* Ignoring new guest behavior may result in unpredictable cancellations, hurting occupancy planning.

* If hotels don't try to convert new → repeat, they miss out on the most stable customer base.

Hotels should:

* Implement early onboarding offers to reduce cancellations from new guests

* Use email follow-ups post-stay to pull them into loyalty loop

#### Chart 21 - Lead Time vs ADR

In [None]:
# Chart 21 - Visualization code

# Features used :- adr, lead_time

# Why this chart ?
# This chart shows the relationship between:
#   lead_time - how far in advance the booking was made
#   adr - price paid per night
# This chart is insightful for revenue managers and pricing teams —
# it tells whether last-minute bookings are cheaper, or if early bookers pay more.

# Filtering extreme values for clarity
filtered_df = df[(df['adr'] < 500) & (df['lead_time'] < 500)]

# Setting plot style
sns.set(style='whitegrid')

# Setting figure size
plt.figure(figsize=(10, 6))

# Creating scatter plot
sns.scatterplot(
    data=filtered_df,
    x='lead_time',
    y='adr',
    alpha=0.4,
    edgecolor=None,
    s=30,
    color='royalblue'
)

# Adding regression trend line (optional)
sns.regplot(
    data=filtered_df,
    x='lead_time',
    y='adr',
    scatter=False,
    color='darkred',
    line_kws={'linewidth': 2}
)

# Labels and title
plt.title('Relationship Between Lead Time and ADR', fontsize=15, weight='bold')
plt.xlabel('Lead Time (Days Before Check-in)', fontsize=12)
plt.ylabel('ADR (Average Daily Rate in EUR)', fontsize=12)

# Enhancing aesthetics
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()

# Showing plot
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* A scatter plot with regression line shows price sensitivity over time.

* It's a classic bivariate analysis — used by revenue analysts to optimize room pricing.

* It adds analytical depth and moves beyond simple distributions.

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* There's no strong linear relationship, but:

    * Some high-ADR bookings occur at very short lead times — possibly last-minute premium bookings.

    * The majority of lower-priced bookings cluster at medium lead times (0-100 days).

* Suggests that pricing is not aggressively time-based, or affected more by season/channel.

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**

Yes:

* Shows whether guests booking earlier pay more or less.

* Can lead to differentiated pricing strategies for early birds vs. last-minute guests.

* Helps in forecasting revenue by identifying trends.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes:

* If there's no clear pricing advantage for early booking, guests might delay decisions, leading to higher cancellation risks.

* Unpredictable pricing makes it hard to optimize dynamic pricing engines.

Hotels can consider:

* Offering discounts for early planners

* Creating urgency for late-bookers with flash pricing

#### Chart 22 - Customer Type vs ADR

In [None]:
# Chart 22 - Visualization code

# Features used - customer_type, adr

# Why this chart ?
# Helps analyze how much each customer segment pays, using:
# This chart helps reveal which customer segment brings in the most revenue,
# and is key for targeting, discounting, and contract negotiation.

# Filtering out high outlier prices for cleaner plot
df_box = df[df['adr'] < 500]

# Setting the style
sns.set(style='whitegrid')

# Creating the figure
plt.figure(figsize=(9, 6))

# Drawing the box plot
sns.boxplot(
    data=df_box,
    x='customer_type',
    y='adr',
    palette='pastel'
)

# Adding title and labels
plt.title('ADR Distribution by Customer Type', fontsize=15, weight='bold')
plt.xlabel('Customer Type', fontsize=12)
plt.ylabel('ADR (Average Daily Rate in EUR)', fontsize=12)

# Improving aesthetics
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()

# Showing the plot
plt.show()

##### 1. Why did you pick the specific chart?

**Answer -**

* It clearly shows how each customer segment behaves in terms of spending.

* Box plots are excellent for comparing distributions across categories.

* Helps answer: Which segment pays more? Who needs upselling? Who books in bulk but cheap?

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* Transient guests have the widest ADR range — from budget to premium.

* Contract customers usually pay lower fixed rates, as expected.

* Group bookings tend to have lower ADRs (volume discounts).

* Transient-party shows a slightly higher median than Transient — possibly families booking premium options.

##### 3. Will the gained insights help creating a positive business impact?

**Answer -**

Yes:

* Identifies profitable segments (e.g., Transient, Transient-party).

* Helps tailor discount policies for group and contract customers.

* Can inform targeted campaigns (e.g., upselling to Transient-party guests).

* Sales and marketing teams can focus more effort where margin per guest is highest.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes:

* If too much business comes from Contract or Group customers, ADRs might be pulled down, reducing revenue.

* Over-reliance on low-ADR segments without upselling leads to low-margin operations.

The hotel should explore value-adds and tiered pricing to increase profitability from lower-paying segments.

#### Chart 23 - Cancellation Rate by Market Segment

In [None]:
# Chart 23 - Visualisation Code.

# Features used :- is_cancelled, market_segment

# Why this chart ?
# Exploring the relationship between :
#   market_segment - how customers found the hotel (Online, Offline, Corporate, Groups, etc.)
#   is_canceled - whether they canceled the booking
# will show cancellation rates per segment — this is much more useful for strategy teams

# Creating a data for visualisation.
cancellation_rates = round((df.groupby('market_segment')['is_canceled'].mean().sort_values(ascending=False) * 100),2)


# Setting values for X and Y axes.
X = cancellation_rates.index
Y = cancellation_rates.values

# Creating and characterising layout.
fig, ax = plt.subplots(figsize=(10,6))

# # Customising grid and axes.
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#C5C9C7')
ax.xaxis.grid(False)

# Adding bar and text for labels on bar in graph.
bar = ax.bar(X, Y)
plt.title('Cancellation Rate By Market Segment', pad=20, fontsize=15)
ax.set_xlabel("Market Segment", fontsize = 15, labelpad = 15)
ax.set_ylabel("Cancellation Rate", fontsize = 15, labelpad = 15)
ax.tick_params(axis = 'x', labelrotation = 20)


# Creating a function for gradient bars.
def gradientbars(bars, cmap):
    ax = bars[0].axes
    lim = ax.get_xlim() + (0,120)
    i = 0
    for bar in bars:
        bar.set_zorder(1)
        bar.set_facecolor('none')
        x,y = bar.get_xy()
        w, h = bar.get_width(), bar.get_height()
        grad = np.atleast_2d(np.linspace(0,1,256)).T
        ax.imshow(grad, extent=[x+w, x, y, y+h], aspect='auto', zorder=1, cmap='viridis')
        plt.annotate(f"{Y[i]}%", (x+0.1, y+h+0.7), fontsize = 12, )
        i+=1
    ax.axis(lim)

gradientbars(bar, 'winter')

# Plotting the plot
plt.plot()


##### 1. Why did you pick the specific chart?

**Answer -**

* Cancellations are a major revenue leak, and knowing which channels cause them helps in strategy decisions.

* This is a bivariate business-critical analysis — not just numbers, but rates.

* It's a clean, professional chart often used in real-world hotel analytics.

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* Online TA (Travel Agents) have the highest cancellation rate (often over 40-50%).

* Corporate and Offline segments have much lower cancellation rates, suggesting more stable bookings.

* Group bookings have moderate cancellations — perhaps due to event changes or coordination issues.

##### 3. Will the gained insights help creating a positive business impact?

**Answer -**

Definitely:

* Push for stricter deposit policies or non-refundable rates in high-cancel segments (like OTA bookings).

* Focus marketing on more stable channels like corporate and offline.

* Helps revenue management teams optimize overbooking strategies.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes:

* Over-reliance on OTA channels with high cancellation rates increases revenue unpredictability.

* May lead to wasted inventory and last-minute unsold rooms.

* Group cancellations, if unmanaged, can disrupt operations or block valuable dates.

* This chart empowers teams to proactively reduce cancellation risks.

#### Chart 24 - Assigned Room Type vs Reserved Room Type

In [None]:
# Chart 24 - Visualization Code

# Features used : reserved_room_type, assigned_room_type

# Why this chart ?
# This is a very interesting operational insight:
# Do guests actually get the room type they reserved, or are they often upgraded/downgraded?
# This impacts:
#     * Customer satisfaction
#     * Room inventory planning
#     * Upgrade policies
# I will use a heatmap of crosstab counts —
# it’s visually unique and perfect for comparing categorical pairings


# Creating crosstab between reserved and assigned room type
room_match = pd.crosstab(df['reserved_room_type'], df['assigned_room_type'])

# Setting up the figure
plt.figure(figsize=(10, 7))

# Plotting the heatmap
sns.heatmap(
    room_match,
    annot=True,
    fmt='d',
    cmap='YlGnBu',
    linewidths=0.5,
    linecolor='gray'
)

# Adding labels
plt.title('Reserved Room Type vs Assigned Room Type', fontsize=15, weight='bold')
plt.xlabel('Assigned Room Type', fontsize=12)
plt.ylabel('Reserved Room Type', fontsize=12)
plt.tight_layout()

# Showing the plot
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* This chart provides a unique operational insight: how often hotels change room types.

* Heatmaps for categorical comparison are rare and very impactful in EDA presentations.

* Helps visually detect patterns — like frequent upgrades from one room to another.

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* Most guests receive the same room type they reserved (diagonal cells will have highest values).

* However, there may be frequent upgrades/downgrades from some room types (e.g., Reserved A but Assigned D).

* Certain types (e.g., G, L) may rarely be assigned — these may be premium rooms held back for special guests.

##### 3. Will the gained insights help creating a positive business impact?

**Answer -**

Yes:

* Helps optimize room inventory planning — knowing which rooms get overbooked or underutilized.

* Enhances guest satisfaction tracking — guests expecting one room but getting another may leave poor reviews.

* Can support pricing decisions — if guests are often upgraded from cheaper rooms, adjust rates or stock.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes:

* Frequent mismatches could indicate poor planning or overbooking — leading to guest dissatisfaction.

* If higher-category rooms are routinely assigned to lower-paying guests, revenue is lost.

* Inconsistent room allocation may erode trust in the booking process.

Hotels can revisit their upgrade policies and forecast demand more accurately using this data.

#### Chart 25 - Stays in Weekend Nights vs Week Nights

In [None]:
# Chart 25 - Visualization Code

# Features used :- stays_in_week_nights, stays_in_weekend_nights

# Why this chart ?
# It highlights customer behavior patterns —
# whether most guests prefer weekday stays (business travelers, corporate bookings)
# or weekend stays (leisure travelers, families).
# Hotels can use this insight for dynamic pricing, staffing decisions,
# and targeted promotions (e.g., discounts for off-peak days).


# Labels and values
labels = ['Stays in Week Nights', 'Stays in Weekend Nights']
count = [df['stays_in_week_nights'].sum(), df['stays_in_weekend_nights'].sum()]

# Plot setup
fig, ax = plt.subplots(figsize=(9, 5))

# Donut chart
wedges, texts, autotexts = ax.pie(
    count,
    autopct='%1.1f%%',
    textprops={'fontsize': 12},
    colors=['cornflowerblue', 'salmon'],
    pctdistance=0.7,
    startangle=90,
    wedgeprops=dict(width=0.5)
)

# Title and legend
ax.set_title('Stays in Weekend Nights Vs Week Nights', fontsize=14, fontweight='bold')
ax.legend(wedges, labels, title="Stay Type", loc="upper left", bbox_to_anchor=(0.8, 0, 0.5, 1))
ax.axis('equal')  # Ensures circle shape

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* I chose a donut pie chart because it provides a clear visual proportion between weekend and weekday stays.

* This helps us understand customer travel patterns — whether people book stays for business or leisure purposes.

* A pie chart is ideal when showing parts of a whole.

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* The majority of stays are on week nights, which may indicate a significant business traveler segment.

* Weekend stays, while lower, still form a sizable portion, pointing to a leisure audience as well.

* This split can help classify the guest base into business vs. leisure personas.

##### 3. Will the gained insights help creating a positive business impact?

**Answer -**

Yes. These insights are directly actionable:

* Hotels can optimize room pricing strategies depending on demand patterns

* Adjust staffing based on busy weekdays or weekends

* Plan offers and events more strategically

* Improve revenue forecasting by knowing when demand is highest

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes. If weekend stays are too low, it may indicate:

* Missed opportunities in the leisure market

* Ineffective weekend promotions or packages

* Lack of events, amenities, or attractions for non-business guests

This can lead to under-utilized rooms and lower weekend revenue, so targeted efforts must be made to boost weekend appeal.

#### Chart 26 - Total Number of Guests vs ADR

In [None]:
# Chart 26 - Visualisation Code

# Features used - total_guests, adr

# Why this chart ?
# This chart will tell:
#   * Do guests who come in bigger groups pay more per night?
#   * Is there any trend between group size and pricing?
# I will use a scatter plot with bubble sizes based on frequency to add more storytelling.


# Filtering out extreme ADR for clean visuals
df_bubble = df[(df['adr'] < 500) & (df['total_guests'] <= 6)]

# Counting frequency of each total_guests vs adr pair (rounded ADR for grouping)
df_bubble['adr_rounded'] = df_bubble['adr'].round()

bubble_data = df_bubble.groupby(['total_guests', 'adr_rounded']).size().reset_index(name='count')

# Plotting setup
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    x=bubble_data['total_guests'],
    y=bubble_data['adr_rounded'],
    s=bubble_data['count'] * 3,  # Bubble size
    alpha=0.5,
    color='mediumseagreen',
    edgecolors='black',
    linewidth=0.5
)

# Labels and title
plt.title('Total Guests vs ADR (Bubble Plot)', fontsize=15, weight='bold')
plt.xlabel('Total Number of Guests', fontsize=12)
plt.ylabel('ADR (Rounded)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()


# Showing the Plot
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* Combines numerical-to-numerical relationship (group size vs pricing) with bubble size as frequency.

* Visually attractive and rarely used by beginners — great for impressing recruiters.

* Highlights guest booking patterns that can inform room design, pricing, and promotions.

##### 2. What is/are the insight(s) found from the chart?

**Answer -**

* Most bookings cluster at 2 guests with ADR around €100 — likely couples.

* Larger guest groups (4-5+) do book, but are less frequent.

* Surprisingly, many larger guest bookings don’t necessarily correlate with higher ADR — could indicate value bundles or family discounts.

##### 3. Will the gained insights help creating a positive business impact?

**Answer -**

Yes:

* Hotels can develop targeted offers for high-frequency booking sizes (e.g., romantic couples, families).

* Identify underpriced larger bookings and optimize pricing or room segmentation.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.

**Answer -**

Yes:

* If larger groups consistently pay less per head, that might signal over-discounting.

* Failure to upsell group stays could mean lost revenue.

* Not enough facilities or tailored packages for mid-sized groups (3-5 guests) might be turning business away.

* Hotels can adjust family packages and improve group targeting based on this.


#### Chart 27 - Lead Time vs Cancellation Rate

In [None]:
# Chart 27 - Visualisation Code

# Features used :- lead_time, is_cancelled

# Why this chart ?
# Will analyze:
#     lead_time - days between booking date and arrival
#     is_canceled - 1 if canceled, 0 if not
# This is one of the most important behavioral insights for hotel revenue teams:
#     * Do guests who book earlier cancel more often?

# Setting the plot size and style
plt.figure(figsize=(8, 6))
sns.set(style='whitegrid')

# Creating boxplot
sns.boxplot(
    data=df,
    x='is_canceled',
    y='lead_time',
    palette='Set2'
)

# Adding titles and labels
plt.title('Lead Time vs Cancellation Status', fontsize=15, weight='bold')
plt.xlabel('Is Canceled (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Lead Time (Days)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()

# Showing the plot
plt.show()

##### 1. Why did you pick the specific chart?

**Answer -**

* Box plots are perfect for showing distribution shifts between groups.

* This chart visually reveals behavioral patterns around cancellations.

* Helps answer whether last-minute vs early bookings behave differently in terms of commitment.

##### 2. What is/are the insight(s) found from the chart?


**Answer -**

* Guests who canceled had much higher lead times on average.

* Guests with shorter lead times tend to follow through with their bookings.

* Cancellations are more common when bookings are made far in advance — possibly due to change of plans or lack of upfront commitment.

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**

Absolutely:

* Suggests enforcing stricter cancellation policies or deposits for long lead time bookings.

* Hotels can adjust overbooking strategies based on lead time profiles.

* Inform targeted confirmation/reminder emails for early bookers to reduce churn.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


**Answer -**

Yes:

* If long-lead bookings dominate and many cancel, hotels suffer inventory lock-in, forecasting issues, and revenue loss.

* Too lenient policies for early bookings could encourage speculative behavior.

A proactive strategy would be to:

* Add non-refundable options with early booking discounts

* Offer early confirmation incentives

#### Chart 28 - Reservation Status vs ADR

In [None]:
# Chart 28 - Visualisation Code

# Features used - adr, reservation_status

# Why this chart ?
# This chart answers:
#     * How does pricing differ across final reservation outcomes?
#     * Are canceled or no-show bookings usually higher-priced?
# I will use a violin plot to show both the distribution shape and
# spread of prices across outcomes. It gives a very refined look
# at revenue risk by status.

# Removing extreme ADR outliers for clarity
df_violin_status = df[df['adr'] < 500]

# Setting style and figure
plt.figure(figsize=(9, 6))
sns.set(style='whitegrid')

# Creating violin plot
sns.violinplot(
    data=df_violin_status,
    x='reservation_status',
    y='adr',
    inner='box',
    palette='Accent'
)

# Adding title and axis labels
plt.title('ADR Distribution by Reservation Status', fontsize=15, weight='bold')
plt.xlabel('Reservation Status', fontsize=12)
plt.ylabel('ADR (EUR)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()

# Showing plot
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* This chart beautifully shows pricing spread and density for each reservation outcome.

* Violin plots add an advanced visual touch that sets your work apart.

* Helps identify if certain statuses (like no-shows) pose a higher revenue risk due to higher ADR.

##### 2. What is/are the insight(s) found from the chart?


**Answer -**

* Checked-out bookings have a fairly wide ADR spread and high volume near €100.

* Canceled bookings also cluster around €100, but have some extremely high-priced entries — that's lost revenue.

* No-shows often had slightly higher ADRs than average, adding to revenue loss risk.

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**

Yes:

* Hotels can create cancellation buffers or prepayment policies for higher-priced bookings.

* Helps tailor reminder flows, pre-check-in confirmations, or penalty conditions.

* Enables revenue teams to better predict price-linked no-shows/cancellations.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


**Answer -**

Yes:

* If high-priced bookings are often canceled or no-shows, this leads to big revenue losses and occupancy gaps.

* Indicates need for stronger prebooking security (credit card holds, advance payments).

* Suggests hotels might be over-relying on volatile booking sources.

Hotels should re-strategize around high-ADR cancellations and booking source validation.

#### Chart 29 - Special Requests vs Cancellation Rate

In [None]:
# Chart 29 - Visualisation Code

# Features used - total_of_special_requests, is_cancelled

# Why this chart ?
# This chart will reveal:
#   * Are customers who place special requests more serious about showing up?
#   * Is there a pattern between guest involvement and cancellation risk?
# It’s an actionable guest behavior analysis.
# Smart hotels track this pattern to predict show-up probability.
# Helps in prioritizing bookings, assigning upgrades, and detecting low-commitment guests.
# It will show cancellation rate (%) across different values of total_of_special_requests.

# Calculating cancellation rate grouped by number of special requests
cancel_rate_special = df.groupby('total_of_special_requests')['is_canceled'].mean() * 100

# Resetting index for plotting
cancel_rate_special = cancel_rate_special.reset_index()

# Setting figure
plt.figure(figsize=(9, 6))
sns.set(style='whitegrid')

# Bar plot
sns.barplot(
    x='total_of_special_requests',
    y='is_canceled',
    data=cancel_rate_special,
    palette='viridis'
)

# Annotating percentage values on top of bars
for index, row in cancel_rate_special.iterrows():
    plt.text(index, row['is_canceled'] + 1, f"{row['is_canceled']:.1f}%", ha='center', fontsize=10)

# Labels and title
plt.title('Cancellation Rate by Number of Special Requests', fontsize=15, weight='bold')
plt.xlabel('Number of Special Requests', fontsize=12)
plt.ylabel('Cancellation Rate (%)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()

# Showing the plot.
plt.show()

##### 1. Why did you pick the specific chart?

**Answer -**

* It's a behavioral indicator of how invested the guest is.

* This pattern helps in booking confidence scoring — a cutting-edge industry practice.

* Also adds a nice numeric-categorical bivariate variation to your visual mix.

##### 2. What is/are the insight(s) found from the chart?


**Answer -**

* Guests with 0 special requests cancel far more frequently.

* As the number of requests increases, the cancellation rate drops sharply.

* Those with 2 or more requests rarely cancel — likely high-intent guests.

This confirms a direct relationship between guest engagement and commitment.

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**


Definitely:

* Helps build a guest quality index — prioritize bookings with more special requests.

* Front-desk and ops teams can use this to offer fast-track check-ins or upgrades.

* Revenue teams can safely overbook low-request profiles, while securing high-request ones.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


**Answer -**

Yes:

* Ignoring low-request, high-cancel profiles could lead to occupancy dips.

* Treating all bookings equally (without segmenting by commitment) may waste upgrade opportunities.

Hotels can build smarter booking pipelines by combining request count with lead time and segment.

#### Chart 30 - ADR by Number of Booking Changes

In [None]:
# Chart 30 - Visualisation Code

# Features used - adr, booking_changes

# Why this chart ?
# This chart will explore:
#     * Do guests who modify their bookings tend to pay more (or less)?
#     * Is booking flexibility correlated with revenue?

# Removing extreme outliers for a cleaner view
df_box = df[df['adr'] < 500]

# Plotting
plt.figure(figsize=(9, 6))
sns.set(style='whitegrid')

# Box plot of ADR by number of booking changes
sns.boxplot(
    data=df_box,
    x='booking_changes',
    y='adr',
    palette='BuPu'
)

# Adding labels and titles
plt.title('ADR by Number of Booking Changes', fontsize=15, weight='bold')
plt.xlabel('Number of Booking Changes', fontsize=12)
plt.ylabel('ADR (EUR)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* Box plots are great for analyzing how a numerical variable (adr) changes across a categorical variable with ordered values (booking_changes).

* This chart uncovers patterns of guest behavior vs pricing — a key insight for hotel pricing strategy.

* It helps detect if high-value guests are also more likely to change bookings (e.g., premium guests modifying plans).



##### 2. What is/are the insight(s) found from the chart?


**Answer -**

* The majority of guests don't make any changes (0) — this group shows the widest ADR range and lowest median.

* Guests who make 1 or 2 changes tend to have higher median ADRs, suggesting they may be:

    * More engaged with the booking process

    * Possibly premium customers who adjust for comfort or schedule

* Guests with many changes (3+) are rare, and their ADRs are lower — possibly due to lower-tier segment experimentation

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**

Absolutely:

* Reveals a valuable customer segment: guests who make 1–2 changes are likely high-paying and involved.

* Revenue and CRM teams can create personalized experiences or offer value-added options (like flexible check-in/out or upgrades) for this segment.

* Helps operations plan better — guests who change bookings often could be flagged for extra service attention.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


**Answer -**

Yes:

* Guests with too many booking changes (3+) could indicate indecisiveness, potential for cancellation, or planning uncertainty.

* May create scheduling conflicts, inventory imbalance, and manual workload for staff.

* If these frequent changers are also low-value, hotels may need to limit excessive modifications through policy (e.g., max free changes).

#### Chart 31 - Cancellation Rate by Deposit Type

In [None]:
# Chart 31 - Visualisation Code

# Features used - deposit_type, is_cancelled

# Why this chart ?
# Deposit policies directly impact booking seriousness and financial security for the hotel.
# This stacked percentage bar plot clearly shows how cancellation rates
# differ across deposit types, making it a goldmine for revenue managers.
# It uncovers which payment models protect against cancellations,
# allowing the business to strategize inventory and pricing better.

# Creating a grouped dataframe
deposit_cancel = df.groupby(['deposit_type', 'is_canceled']).size().unstack()

# Converting counts to percentage
deposit_cancel_percent = deposit_cancel.div(deposit_cancel.sum(axis=1), axis=0) * 100

# Plotting the stacked bar chart
deposit_cancel_percent.plot(
    kind='bar',
    stacked=True,
    figsize=(8, 6),
    color=['mediumseagreen', 'indianred']
)

# Chart styling
plt.title('Cancellation Rate by Deposit Type', fontsize=15, weight='bold')
plt.xlabel('Deposit Type', fontsize=12)
plt.ylabel('Percentage (%)', fontsize=12)
plt.legend(['Not Canceled', 'Canceled'], title='Booking Status', loc = 'upper right')
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

**Answer -**

* Deposit policies directly impact booking seriousness and financial security for the hotel.

* This stacked percentage bar plot clearly shows how cancellation rates differ across deposit types, making it a goldmine for revenue managers.

* It uncovers which payment models protect against cancellations, allowing the business to strategize inventory and pricing better.


##### 2. What is/are the insight(s) found from the chart?


**Answer -**

* No Deposit bookings have a very high cancellation rate — around 40-50% in most cases.

* Non Refundable bookings are almost never canceled — guests follow through because they've prepaid.

* Refundable deposit bookings are a small share, but still show relatively lower cancellations compared to “No Deposit”.

##### 3. Will the gained insights help creating a positive business impact?


**Answer -**

100%:

* Encourages shifting toward prepaid (non-refundable) and semi-flex deposit types to secure revenue.

* Hotels can incentivize non-refundable bookings with discounts, and limit exposure to speculative “no-deposit” bookings.

* Sales teams can use this insight to negotiate OTA contracts and tweak cancellation policies.

##### 4. Are there any insights that lead to negative growth? Justify with specific reason.


**Answer -**

Yes:

* Over-reliance on No Deposit bookings increases cancellation volume, leading to:

    * Lost revenue

    * Last-minute room inventory wastage

    * Staff inefficiency

* If hotels don't offer prepaid incentives, they risk losing committed guests to competitors who do.

A good strategy would be:

  * Offer better pricing on non-refundable options

  * Push prepaid plans for peak seasons or high-demand weekends

  * Balance flexibility with deposit-linked trust metrics