# 🔎 Data Exploration: Hotel Booking Dataset

This notebook provides an initial exploration of the hotel booking dataset, focusing on understanding its structure, key statistics, and inter-variable relationships. The goal is to prepare the data for later modeling and visualization.


## Dataset Overview

In [None]:
# Importing libraries

import matplotlib.pyplot as plt
import pandas as pd
import os
import seaborn as sns

In [2]:
# Importing dataset

hotel_dataset = pd.read_csv('../data/hotel_bookings.csv')

In [3]:
hotel_dataset.shape

(36275, 19)

The dataset includes around 36,000 entries with 19 attributes capturing key details about customer reservations, booking behavior, and hotel operations.

In [6]:
# Data summary

hotel_dataset.describe()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
count,35862.0,35951.0,35908.0,35468.0,33683.0,35803.0,35897.0,35771.0,35294.0,35689.0,35778.0,35725.0,35815.0,35486.0
mean,1.845017,0.105366,0.810209,2.20331,0.030698,85.276569,2017.820431,7.424031,15.605712,0.025666,0.023646,0.154458,103.418207,0.619343
std,0.518652,0.402871,0.870857,1.40989,0.172501,85.998845,0.383834,3.068277,8.743484,0.15814,0.370835,1.764805,35.057342,0.785849
min,0.0,0.0,0.0,0.0,0.0,0.0,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,1.0,0.0,17.0,2018.0,5.0,8.0,0.0,0.0,0.0,80.3,0.0
50%,2.0,0.0,1.0,2.0,0.0,57.0,2018.0,8.0,16.0,0.0,0.0,0.0,99.45,0.0
75%,2.0,0.0,2.0,3.0,0.0,126.0,2018.0,10.0,23.0,0.0,0.0,0.0,120.0,1.0
max,4.0,10.0,7.0,17.0,1.0,443.0,2018.0,12.0,31.0,1.0,13.0,58.0,540.0,5.0


The median number of adults per booking is 2, while the median number of children is 1. On average, guests stay for 1 night over the weekend and 1 night during the week. Reservations are typically made 57 days in advance, with peak activity observed during the summer months. The average room price is approximately $100 per night.

## Correlation Analysis

In [None]:
# Select only numeric columns
bookings_numeric_data = hotel_dataset.select_dtypes(include='number')

# Compute correlation matrix
correlation_table = bookings_numeric_data.corr()

# Plot
fig, ax = plt.subplots(figsize=(12, 10))  # Larger figure size for clarity

sns.heatmap(correlation_table, 
            annot=True, 
            fmt=".2f", 
            cmap="Blues", 
            cbar=True, 
            linewidths=0.5, 
            linecolor='white', 
            ax=ax)

# Titles and styling
ax.set_title('Correlation Matrix', fontsize=16)
ax.tick_params(axis='x', rotation=90)
ax.tick_params(axis='y', rotation=0)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.savefig("../plots/correlation_matrix.png", dpi=300, bbox_inches='tight')
plt.show()


A correlation analysis was performed among the numeric variables to evaluate potential relationships.

The results revealed moderate correlations between certain variables, particularly between repeated guest status, number of previous cancellations, and number of bookings not canceled. However, none of the correlation coefficients exceeded 0.5, indicating that while some associations exist, the variables do not exhibit strong linear relationships or significant redundancy.

No strong correlations were identified among the numeric features, supporting the decision to retain all numeric variables for subsequent modeling and analysis without concerns of multicollinearity at this stage.