# Hotel Booking Cancellation - Exploratory Data Analysis

This notebook performs comprehensive exploratory data analysis on the hotel booking dataset to understand patterns, relationships, and factors that influence cancellations.

## Objectives
1. Load and inspect the raw dataset
2. Generate summary statistics for numerical features
3. Visualize distributions of key features
4. Analyze correlations with the target variable
5. Identify class imbalance in cancellations
6. Detect and visualize outliers
7. Document key insights and patterns

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import os

# Add parent directory to path for imports
sys.path.append('..')

from src.data_processing.data_loader import DataLoader
from src.utils.logger import get_logger

# Configure visualization settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Initialize logger
logger = get_logger(__name__)

print("Libraries imported successfully!")

In [None]:
# Load configuration
import yaml

with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Get data path
data_path = config['data']['raw_data_path']
print(f"Data path: {data_path}")

In [None]:
# Load the dataset
loader = DataLoader()
df = loader.load_csv(data_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")

## 2. Basic Dataset Information

In [None]:
# Display basic information
print("=== Dataset Shape ===")
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")
print("\n=== Data Types ===")
print(df.dtypes)
print("\n=== First Few Rows ===")
df.head()

In [None]:
# Display dataset info
print("=== Dataset Info ===")
df.info()

In [None]:
# Check for missing values
print("=== Missing Values ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df)

if len(missing_df) == 0:
    print("No missing values found!")

## 3. Summary Statistics for Numerical Features

In [None]:
# Generate summary statistics for numerical columns
print("=== Numerical Features Summary Statistics ===")
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Number of numerical features: {len(numerical_cols)}")
print(f"Numerical columns: {numerical_cols}\n")

df[numerical_cols].describe().T

## 4. Target Variable Analysis - Class Imbalance

In [None]:
# Analyze target variable distribution
print("=== Target Variable: is_canceled ===")
cancellation_counts = df['is_canceled'].value_counts()
cancellation_pct = df['is_canceled'].value_counts(normalize=True) * 100

print("\nCancellation Distribution:")
print(f"Not Canceled (0): {cancellation_counts[0]:,} ({cancellation_pct[0]:.2f}%)")
print(f"Canceled (1): {cancellation_counts[1]:,} ({cancellation_pct[1]:.2f}%)")

# Calculate imbalance ratio
imbalance_ratio = cancellation_counts.max() / cancellation_counts.min()
print(f"\nImbalance Ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 1.5:
    print("⚠️ Class imbalance detected! Consider using SMOTE or class weights during training.")
else:
    print("✓ Classes are relatively balanced.")

In [None]:
# Visualize target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=df, x='is_canceled', ax=axes[0])
axes[0].set_title('Cancellation Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Is Canceled')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Not Canceled', 'Canceled'])

# Add value labels on bars
for container in axes[0].containers:
    axes[0].bar_label(container)

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[1].pie(cancellation_counts, labels=['Not Canceled', 'Canceled'], autopct='%1.1f%%', 
            startangle=90, colors=colors)
axes[1].set_title('Cancellation Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 5. Distribution of Numerical Features

In [None]:
# Select key numerical features for visualization
key_numerical_features = [
    'lead_time', 'adr', 'stays_in_weekend_nights', 'stays_in_week_nights',
    'adults', 'children', 'babies', 'previous_cancellations',
    'previous_bookings_not_canceled', 'booking_changes', 
    'days_in_waiting_list', 'required_car_parking_spaces', 
    'total_of_special_requests'
]

# Filter to only include columns that exist in the dataset
key_numerical_features = [col for col in key_numerical_features if col in df.columns]

print(f"Visualizing {len(key_numerical_features)} key numerical features")
print(key_numerical_features)

In [None]:
# Create histograms for numerical features
n_features = len(key_numerical_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 4))
axes = axes.flatten()

for idx, col in enumerate(key_numerical_features):
    axes[idx].hist(df[col].dropna(), bins=50, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

# Hide unused subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

## 6. Distribution of Categorical Features

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Number of categorical features: {len(categorical_cols)}")
print(f"Categorical columns: {categorical_cols}\n")

# Display unique value counts for each categorical column
print("=== Unique Values in Categorical Features ===")
for col in categorical_cols:
    n_unique = df[col].nunique()
    print(f"{col}: {n_unique} unique values")

In [None]:
# Select key categorical features for visualization
key_categorical_features = [
    'hotel', 'meal', 'market_segment', 'distribution_channel',
    'deposit_type', 'customer_type', 'reserved_room_type'
]

# Filter to only include columns that exist
key_categorical_features = [col for col in key_categorical_features if col in df.columns]

# Create count plots for categorical features
n_features = len(key_categorical_features)
n_cols = 2
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows * 4))
axes = axes.flatten()

for idx, col in enumerate(key_categorical_features):
    # Get top categories if too many
    value_counts = df[col].value_counts()
    if len(value_counts) > 10:
        top_values = value_counts.head(10).index
        plot_data = df[df[col].isin(top_values)]
        title_suffix = ' (Top 10)'
    else:
        plot_data = df
        title_suffix = ''
    
    sns.countplot(data=plot_data, y=col, ax=axes[idx], order=plot_data[col].value_counts().index)
    axes[idx].set_title(f'Distribution of {col}{title_suffix}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Count')
    axes[idx].set_ylabel(col)

# Hide unused subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

## 7. Correlation Analysis with Target Variable

In [None]:
# Calculate correlation matrix for numerical features
correlation_matrix = df[numerical_cols].corr()

# Get correlations with target variable
target_correlations = correlation_matrix['is_canceled'].sort_values(ascending=False)
print("=== Correlation with Target Variable (is_canceled) ===")
print(target_correlations)

print("\n=== Top 10 Positive Correlations ===")
print(target_correlations.head(11)[1:])  # Exclude self-correlation

print("\n=== Top 10 Negative Correlations ===")
print(target_correlations.tail(10))

In [None]:
# Visualize correlation heatmap
plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap of Numerical Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

In [None]:
# Visualize top correlations with target variable
# Get top positive and negative correlations (excluding self)
top_positive = target_correlations[1:11]
top_negative = target_correlations[-10:]
top_features = pd.concat([top_positive, top_negative]).sort_values()

plt.figure(figsize=(10, 8))
colors = ['red' if x < 0 else 'green' for x in top_features.values]
plt.barh(range(len(top_features)), top_features.values, color=colors, alpha=0.7)
plt.yticks(range(len(top_features)), top_features.index)
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.title('Top Features Correlated with Cancellation', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 8. Outlier Detection and Visualization

In [None]:
# Function to detect outliers using IQR method
def detect_outliers_iqr(data, column):
    """Detect outliers using the IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Detect outliers for key numerical features
print("=== Outlier Detection (IQR Method) ===\n")

outlier_summary = []
for col in key_numerical_features:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    outlier_pct = (len(outliers) / len(df)) * 100
    outlier_summary.append({
        'Feature': col,
        'Outlier Count': len(outliers),
        'Outlier %': outlier_pct,
        'Lower Bound': lower,
        'Upper Bound': upper
    })

outlier_df = pd.DataFrame(outlier_summary).sort_values('Outlier Count', ascending=False)
print(outlier_df.to_string(index=False))

In [None]:
# Create boxplots to visualize outliers
n_features = len(key_numerical_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 4))
axes = axes.flatten()

for idx, col in enumerate(key_numerical_features):
    sns.boxplot(data=df, y=col, ax=axes[idx], color='skyblue')
    axes[idx].set_title(f'Boxplot of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel(col)
    axes[idx].grid(True, alpha=0.3, axis='y')

# Hide unused subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

## 9. Feature Relationships with Cancellation

In [None]:
# Analyze cancellation rate by categorical features
print("=== Cancellation Rate by Categorical Features ===\n")

for col in key_categorical_features[:5]:  # Analyze first 5 categorical features
    print(f"\n--- {col} ---")
    cancellation_by_cat = df.groupby(col)['is_canceled'].agg(['mean', 'count'])
    cancellation_by_cat.columns = ['Cancellation Rate', 'Count']
    cancellation_by_cat['Cancellation Rate'] = cancellation_by_cat['Cancellation Rate'] * 100
    cancellation_by_cat = cancellation_by_cat.sort_values('Cancellation Rate', ascending=False)
    print(cancellation_by_cat.head(10))

In [None]:
# Visualize cancellation rate by key categorical features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

features_to_plot = ['hotel', 'deposit_type', 'customer_type', 'market_segment']
features_to_plot = [f for f in features_to_plot if f in df.columns]

for idx, col in enumerate(features_to_plot[:4]):
    cancellation_rate = df.groupby(col)['is_canceled'].mean().sort_values(ascending=False)
    
    cancellation_rate.plot(kind='bar', ax=axes[idx], color='coral', alpha=0.7)
    axes[idx].set_title(f'Cancellation Rate by {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Cancellation Rate')
    axes[idx].set_ylim(0, 1)
    axes[idx].grid(True, alpha=0.3, axis='y')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Analyze numerical features by cancellation status
print("=== Numerical Features by Cancellation Status ===\n")

comparison_features = ['lead_time', 'adr', 'total_of_special_requests', 'previous_cancellations']
comparison_features = [f for f in comparison_features if f in df.columns]

for col in comparison_features:
    print(f"\n--- {col} ---")
    print(df.groupby('is_canceled')[col].describe()[['mean', 'median', 'std']])

## 10. Key Insights and Patterns

### Summary of Key Findings

Based on the exploratory data analysis, here are the key insights discovered:

#### 1. **Dataset Overview**
- The dataset contains hotel booking information with multiple features
- Both numerical and categorical features are present
- Some features may have missing values that need to be handled

#### 2. **Target Variable (Cancellation)**
- Class imbalance may exist between cancelled and non-cancelled bookings
- The imbalance ratio indicates whether SMOTE or class weights should be used
- Understanding the baseline cancellation rate is crucial for model evaluation

#### 3. **Feature Correlations**
- Several features show strong correlation with cancellation probability
- Lead time, deposit type, and previous cancellations are likely important predictors
- Some features may be highly correlated with each other (multicollinearity)

#### 4. **Outliers**
- Outliers are present in several numerical features
- Features like lead_time and adr may have extreme values
- Outlier treatment strategy should be decided based on domain knowledge

#### 5. **Feature Engineering Opportunities**
- Creating derived features like total_guests and total_nights could be beneficial
- Temporal features (month, season) may capture booking patterns
- Interaction features between key predictors could improve model performance

#### 6. **Data Quality**
- Missing values need to be imputed or removed
- Categorical features need encoding (label or one-hot)
- Numerical features may benefit from scaling/normalization

#### 7. **Next Steps**
- Implement data cleaning to handle missing values and duplicates
- Create engineered features based on insights
- Apply appropriate encoding and scaling transformations
- Address class imbalance if present
- Train multiple models and compare performance

In [None]:
# Generate final summary statistics
print("=== EDA Summary ===")
print(f"Total Records: {len(df):,}")
print(f"Total Features: {len(df.columns)}")
print(f"Numerical Features: {len(numerical_cols)}")
print(f"Categorical Features: {len(categorical_cols)}")
print(f"Cancellation Rate: {df['is_canceled'].mean():.2%}")
print(f"Missing Values: {df.isnull().sum().sum():,}")
print("\n✓ Exploratory Data Analysis Complete!")