# 🏨 Hotel Cancellation Prediction Project

## 🎯 Learning Objectives
Welcome to your neural network journey! This project will teach you:

### Core Concepts You'll Master
- **Binary Classification**: Predicting Yes/No (Will cancel? Yes/No)
- **Multiclass Classification**: Predicting categories (Check-out/Canceled/No-show)
- **PyTorch Fundamentals**: Building and training neural networks
- **Real-world ML**: Using actual hotel booking data

## 📚 Neural Network Fundamentals

### Architecture Components
- **Layers**: Input → Hidden → Output layers
- **Activation Functions**: ReLU, Sigmoid, Softmax
- **Loss Functions**: Binary Cross-Entropy, Cross-Entropy
- **Optimizers**: Adam optimizer for training
- **Evaluation**: Accuracy, Precision, Recall, F1-Score

## 🚀 Project Structure Overview

### Phase Breakdown
- **Phase 1**: Data Exploration (Tasks 1-5)
- **Phase 2**: Data Preprocessing (Tasks 6-9)  
- **Phase 3**: Model Preparation (Tasks 10-13)
- **Phase 4**: Binary Classification (Tasks 14-18)
- **Phase 5**: Multiclass Classification (Tasks 19-26)

## 💡 Learning Tips for Beginners

### Success Strategies
- **Take your time** - Don't rush through tasks
- **Read each explanation** carefully
- **Experiment** with the code
- **Ask questions** when confused
- **Celebrate small wins** - Each task completed is progress!

---

## 📊 Dataset Information

### Dataset Details
- **Source**: Kaggle - Hotel Booking Demand Dataset
- **Size**: 119,390 hotel bookings
- **Features**: 32 columns including dates, demographics, pricing
- **Target**: Predict if bookings will be canceled

## 🎯 Business Value

### Real-World Impact
This model helps hotels:
- 🎯 **Optimize revenue** by predicting cancellations
- 👥 **Allocate resources** (staff, rooms) better
- 📈 **Target marketing** to high-risk customers


## 📥 Dataset Download Function

### 🎯 What This Cell Does
This cell creates a reusable function to download Kaggle datasets and organize them in your project.

### 🧠 Learning Concepts
- **Functions**: Reusable code blocks
- **File Management**: Creating folders and copying files
- **Error Handling**: Using `os.makedirs(exist_ok=True)`

### 💡 Why This Approach?
- **Organized**: All datasets in one `data/` folder
- **Reusable**: Works for any Kaggle dataset
- **Portable**: Dataset travels with your project
- **Clean**: No messy cache paths in your code

### 🔧 How It Works
1. **Creates** a `data/` folder if it doesn't exist
2. **Downloads** dataset from Kaggle
3. **Copies** files to your project folder
4. **Returns** the local path for easy access

### 🚀 Run This Cell
Execute this cell to set up your dataset download function. You'll see output showing the download progress and file copying.

In [1]:
import kagglehub
import os
import shutil

def download_kaggle_dataset(dataset_name, project_root=None):
    """
    Download a Kaggle dataset and copy it to the project's data folder.
    
    Args:
        dataset_name (str): Kaggle dataset name in format 'username/dataset-name'
        project_root (str): Path to project root. If None, uses current directory.
    
    Returns:
        str: Path to the local data folder
    """
    if project_root is None:
        project_root = os.getcwd()
    
    # Create data folder if it doesn't exist
    data_folder = os.path.join(project_root, 'data')
    os.makedirs(data_folder, exist_ok=True)
    
    # Download dataset from Kaggle
    print(f"Downloading dataset: {dataset_name}")
    kaggle_path = kagglehub.dataset_download(dataset_name)
    print(f"Kaggle download path: {kaggle_path}")
    
    # Copy files to local data folder
    for file in os.listdir(kaggle_path):
        src = os.path.join(kaggle_path, file)
        dst = os.path.join(data_folder, file)
        if os.path.isfile(src):
            shutil.copy2(src, dst)
            print(f"Copied: {file}")
    
    print(f"Dataset files copied to: {data_folder}")
    return data_folder

# Download the hotel booking dataset
data_path = download_kaggle_dataset("jessemostipak/hotel-booking-demand")

  from .autonotebook import tqdm as notebook_tqdm


Downloading dataset: jessemostipak/hotel-booking-demand
Kaggle download path: /Users/franciscoteixeirabarbosa/.cache/kagglehub/datasets/jessemostipak/hotel-booking-demand/versions/1
Copied: hotel_bookings.csv
Dataset files copied to: /Users/franciscoteixeirabarbosa/Dropbox/Random_scripts/predict_hotel_cancellations/data


## 📊 Load and Explore Dataset

### 🎯 What This Cell Does
This cell loads the hotel booking dataset and gives you a first look at the data.

### 🧠 Learning Concepts
- **DataFrames**: Pandas' main data structure (like Excel sheets)
- **Data Shape**: Rows × Columns (119,390 × 32)
- **Data Types**: Different types of data (numbers, text, dates)

### 📋 Task 1: Import and Inspect
- **Goal**: Load the CSV file into a pandas DataFrame
- **Method**: Use `pd.read_csv()` to read the file
- **Output**: See the first 5 rows with `.head()`

### 🔍 What You'll See
- **Shape**: (119,390, 32) = 119,390 bookings with 32 features
- **Columns**: 32 different pieces of information per booking
- **Sample Data**: First 5 rows to understand the data structure

### 💡 Key Columns to Notice
- `is_canceled`: Our main target (0 = No, 1 = Yes)
- `reservation_status`: Our multiclass target
- `hotel`: Type of hotel
- `lead_time`: Days between booking and arrival
- `adr`: Average Daily Rate (price per night)

### 🚀 Run This Cell
Execute this cell to load your dataset and see what you're working with!


In [2]:
import pandas as pd

# Load the dataset from local data folder
df = pd.read_csv("data/hotel_bookings.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Method 1: Set pandas display options to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("\n=== Full DataFrame Preview (all columns visible) ===")
df.head()


Dataset shape: (119390, 32)
Columns: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

=== Full DataFrame Preview (all columns visible) ===


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


## 🔍 Alternative Ways to Explore Data

### 🎯 What This Cell Shows
Different methods to explore your data when `df.head()` hides columns with `...`

### 🧠 Learning Concepts
- **Pandas Display Options**: Controlling how data is shown
- **Column Selection**: Viewing specific columns
- **Data Sampling**: Looking at different parts of the data

### 💡 Why This Matters
- **Complete View**: See all columns, not just a subset
- **Better Understanding**: Know exactly what data you're working with
- **Debugging**: Spot issues in specific columns


In [3]:
# Method 2: View specific columns you're interested in
print("=== Key Columns Preview ===")
key_columns = ['hotel', 'is_canceled', 'lead_time', 'arrival_date_month', 'adults', 'children', 'adr', 'reservation_status']
df[key_columns].head()

print("\n=== All Columns (transposed view) ===")
# Method 3: Transpose to see all columns as rows
df.head(3).T  # T means transpose (rows become columns)

print("\n=== Column Information ===")
# Method 4: Get detailed info about each column
print(f"Total columns: {len(df.columns)}")
print(f"Column names: {df.columns.tolist()}")


=== Key Columns Preview ===

=== All Columns (transposed view) ===

=== Column Information ===
Total columns: 32
Column names: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']


## 🔍 Data Inspection

### 🎯 What This Cell Does
This cell helps you understand the data types and check for missing values.

### 🧠 Learning Concepts
- **Data Types**: int64, float64, object (text)
- **Missing Values**: NaN (Not a Number) - empty cells
- **Memory Usage**: How much RAM the data uses

### 📋 Task 2: Inspect Data Types and Missing Values
- **Goal**: Understand what type of data each column contains
- **Method**: Use `.info()` method
- **Look for**: Missing values (non-null count < total rows)

### 🔍 What to Look For
- **Missing Values**: Columns with fewer non-null values
- **Data Types**: 
  - `int64`: Whole numbers (1, 2, 3...)
  - `float64`: Decimal numbers (1.5, 2.7...)
  - `object`: Text/strings ("Resort Hotel", "City Hotel")

### 💡 Why This Matters
- **Missing Values**: Can cause problems in machine learning
- **Data Types**: Affect how we process the data
- **Memory**: Large datasets need efficient storage

### 🚀 Run This Cell
Execute this cell to see the data structure and identify any issues!


In [None]:
# Task 2: Inspect data types and missing values
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [None]:
# Example: Explore relationship between hotel type and cancellations
print("Hotel type vs cancellation rate:")
df.groupby('hotel')['is_canceled'].mean()


hotel


In [None]:
# Example: How does hotel type relate to cancellations?
df.groupby('hotel')['is_canceled'].mean()

arrival_date_month
April        0.407972
August       0.377531
December     0.349705
February     0.334160
January      0.304773
July         0.374536
June         0.414572
March        0.321523
May          0.396658
November     0.312334
October      0.380466
September    0.391702
Name: is_canceled, dtype: float64 

meal
BB           0.373849
FB           0.598997
HB           0.344603
SC           0.372394
Undefined    0.244654
Name: is_canceled, dtype: float64 

deposit_type
No Deposit    0.283770
Non Refund    0.993624
Refundable    0.222222
Name: is_canceled, dtype: float64 

market_segment
Aviation         0.219409
Complementary    0.130552
Corporate        0.187347
Direct           0.153419
Groups           0.610620
Offline TA/TO    0.343160
Online TA        0.367211
Undefined        1.000000
Name: is_canceled, dtype: float64 

country
ABW    0.000000
AGO    0.566298
AIA    0.000000
ALB    0.166667
AND    0.714286
         ...   
VGB    1.000000
VNM    0.250000
ZAF    0.387500
Z

## 📊 Analyze Cancellation Rates

### 🎯 What This Cell Does
This cell analyzes how many bookings are canceled vs. not canceled.

### 🧠 Learning Concepts
- **Target Variable**: What we want to predict (`is_canceled`)
- **Class Distribution**: How balanced/unbalanced our data is
- **Percentage Calculations**: Understanding proportions

### 📋 Task 3: Explore Cancellation Column
- **Goal**: Count cancellations and non-cancellations
- **Method**: Use `.value_counts()` on `is_canceled` column
- **Output**: Numbers and percentages

### 🔍 What You'll See
- **0**: Not canceled (guests showed up)
- **1**: Canceled (guests canceled their booking)
- **Counts**: How many of each
- **Percentages**: What % of total bookings

### 💡 Why This Matters
- **Class Imbalance**: If one class is much more common
- **Business Impact**: High cancellation rates = lost revenue
- **Model Performance**: Unbalanced data affects accuracy

### 🚀 Run This Cell
Execute this cell to see the cancellation distribution!


In [None]:
# Task 3: Count cancellations and non-cancellations
df['is_canceled'].value_counts()

hotel: range = 0.13963608350237788
is_canceled: range = 1.0
lead_time: range = 1.0
arrival_date_year: range = 0.02834566882505818
arrival_date_month: range = 0.10979856694997209
arrival_date_week_number: range = 0.24287037109259652
arrival_date_day_of_month: range = 0.09482466771841208
stays_in_weekend_nights: range = 0.7368421052631579
stays_in_week_nights: range = 1.0
adults: range = 0.7419354838709677
children: range = 0.7763157894736842
babies: range = 0.3718737602660522
meal: range = 0.3543439436915643
country: range = 1.0
market_segment: range = 0.8694481830417228
distribution_channel: range = 0.6254011608057358
is_repeated_guest: range = 0.23296894948176466
previous_cancellations: range = 0.8947368421052632
previous_bookings_not_canceled: range = 0.5
reserved_room_type: range = 0.7071155317521041
assigned_room_type: range = 0.9862258953168044
booking_changes: range = 0.5
deposit_type: range = 0.7714022379135151
agent: range = 1.0
company: range = 1.0
days_in_waiting_list: range 

## 🚨 Check Reservation Status (IMPORTANT!)

### 🎯 What This Cell Does
This cell examines the `reservation_status` column, which we'll use for multiclass classification.

### ⚠️ CRITICAL WARNING
**DO NOT use `reservation_status` in your binary model!** This would be data leakage - it already tells us the answer!

### 🧠 Learning Concepts
- **Data Leakage**: Using future information to predict the past
- **Multiclass vs Binary**: Different prediction tasks
- **Feature Selection**: Choosing the right inputs

### 📋 Task 4: Explore Reservation Status
- **Goal**: Understand the three categories
- **Method**: Use `.value_counts()` on `reservation_status`
- **Categories**:
  - **Check-Out**: Guest showed up and stayed
  - **Canceled**: Guest canceled before arrival
  - **No-Show**: Guest didn't show up (didn't cancel)

### 💡 Why This Matters
- **Binary Model**: Predicts `is_canceled` (0/1)
- **Multiclass Model**: Predicts `reservation_status` (3 categories)
- **Different Tasks**: Each model solves a different problem

### 🚀 Run This Cell
Execute this cell to see the three booking outcomes!


In [None]:
# Task 4: Check reservation status categories
df['reservation_status'].value_counts()


## 📅 Monthly Cancellation Patterns

### 🎯 What This Cell Does
This cell analyzes cancellation rates by month to find seasonal patterns.

### 🧠 Learning Concepts
- **Grouping**: Combining data by categories
- **Aggregation**: Calculating statistics (mean, sum, count)
- **Sorting**: Ordering results from low to high

### 📋 Task 5: Analyze Cancellations by Month
- **Goal**: Find which months have highest/lowest cancellation rates
- **Method**: 
  1. Group by `arrival_date_month`
  2. Calculate mean of `is_canceled` (gives percentage)
  3. Sort from lowest to highest

### 🔍 What You'll See
- **Months**: January, February, March, etc.
- **Cancellation Rates**: Percentage for each month
- **Patterns**: Seasonal trends in cancellations

### 💡 Why This Matters
- **Business Insights**: When to expect more cancellations
- **Feature Engineering**: Month could be a useful predictor
- **Marketing**: Target high-risk periods

### 🚀 Run This Cell
Execute this cell to discover seasonal cancellation patterns!


In [None]:
# Task 5: Analyze cancellations by month
monthly_cancellations = df.groupby('arrival_date_month')['is_canceled'].mean()
monthly_cancellations.sort_values()


In [None]:
interesting_columns = []

for data_column in df.columns:
    if data_column != "is_canceled":
        cancellation_rate = df.groupby(data_column)["is_canceled"].mean()
        range_difference = cancellation_rate.max() - cancellation_rate.min()
        print(f"{data_column}: range = {range_difference}")
        if range_difference > 0.15:
            interesting_columns.append([data_column, range_difference])

print(interesting_columns)

In [59]:
# Find columns with very high ranges (might be biased)
high_range_columns = []
for item in interesting_columns:
    if item[1] > 0.8:  # Very high range
        high_range_columns.append(item[0])
print("High range columns:", high_range_columns)

# Sort by range difference (highest first)
sorted_columns = sorted(interesting_columns, key=lambda x: x[1], reverse=True)
print("Sorted by range:")
for item in sorted_columns:
    print(f"{item[0]}: {item[1]:.3f}")

# Keep only columns with meaningful ranges
useful_columns = []
for item in interesting_columns:
    if item[1] < 1.0:  # Exclude range = 1.0
        useful_columns.append(item[0])
print("Useful columns:", useful_columns)

# Maybe also exclude very high ranges (0.8+)
reasonable_columns = []
for item in interesting_columns:
    if 0.1 < item[1] < 0.8:  # Reasonable range
        reasonable_columns.append(item[0])
print("Reasonable columns:", reasonable_columns)

# Check which columns have range = 1.0
problematic_columns = []
for item in interesting_columns:
    if item[1] == 1.0:
        problematic_columns.append(item[0])
print("Problematic columns (range = 1.0):", problematic_columns)


High range columns: ['is_canceled', 'lead_time', 'stays_in_week_nights', 'country', 'market_segment', 'previous_cancellations', 'assigned_room_type', 'agent', 'company', 'days_in_waiting_list', 'adr', 'reservation_status', 'reservation_status_date']
Sorted by range:
is_canceled: 1.000
lead_time: 1.000
stays_in_week_nights: 1.000
country: 1.000
agent: 1.000
company: 1.000
days_in_waiting_list: 1.000
adr: 1.000
reservation_status: 1.000
reservation_status_date: 1.000
assigned_room_type: 0.986
previous_cancellations: 0.895
market_segment: 0.869
children: 0.776
deposit_type: 0.771
adults: 0.742
stays_in_weekend_nights: 0.737
reserved_room_type: 0.707
distribution_channel: 0.625
previous_bookings_not_canceled: 0.500
booking_changes: 0.500
total_of_special_requests: 0.427
required_car_parking_spaces: 0.395
babies: 0.372
meal: 0.354
customer_type: 0.305
arrival_date_week_number: 0.243
is_repeated_guest: 0.233
Useful columns: ['arrival_date_week_number', 'stays_in_weekend_nights', 'adults', 'c

## 📊 Analyze Cancellation Rates

### 🎯 What This Cell Does
This cell analyzes how many bookings are canceled vs. not canceled.

### 🧠 Learning Concepts
- **Target Variable**: What we want to predict (`is_canceled`)
- **Class Distribution**: How balanced/unbalanced our data is
- **Percentage Calculations**: Understanding proportions

### 📋 Task 3: Explore Cancellation Column
- **Goal**: Count cancellations and non-cancellations
- **Method**: Use `.value_counts()` on `is_canceled` column
- **Output**: Numbers and percentages

### 🔍 What You'll See
- **0**: Not canceled (guests showed up)
- **1**: Canceled (guests canceled their booking)
- **Counts**: How many of each
- **Percentages**: What % of total bookings

### 💡 Why This Matters
- **Class Imbalance**: If one class is much more common
- **Business Impact**: High cancellation rates = lost revenue
- **Model Performance**: Unbalanced data affects accuracy

### 🚀 Run This Cell
Execute this cell to see the cancellation distribution!


In [None]:
# Task 3: Count cancellations and non-cancellations
df['is_canceled'].value_counts()


## 🚨 Check Reservation Status (IMPORTANT!)

### 🎯 What This Cell Does
This cell examines the `reservation_status` column, which we'll use for multiclass classification.

### ⚠️ CRITICAL WARNING
**DO NOT use `reservation_status` in your binary model!** This would be data leakage - it already tells us the answer!

### 🧠 Learning Concepts
- **Data Leakage**: Using future information to predict the past
- **Multiclass vs Binary**: Different prediction tasks
- **Feature Selection**: Choosing the right inputs

### 📋 Task 4: Explore Reservation Status
- **Goal**: Understand the three categories
- **Method**: Use `.value_counts()` on `reservation_status`
- **Categories**:
  - **Check-Out**: Guest showed up and stayed
  - **Canceled**: Guest canceled before arrival
  - **No-Show**: Guest didn't show up (didn't cancel)

### 💡 Why This Matters
- **Binary Model**: Predicts `is_canceled` (0/1)
- **Multiclass Model**: Predicts `reservation_status` (3 categories)
- **Different Tasks**: Each model solves a different problem

### 🚀 Run This Cell
Execute this cell to see the three booking outcomes!


In [None]:
# Task 4: Check reservation status categories
df['reservation_status'].value_counts()


## 📅 Monthly Cancellation Patterns

### 🎯 What This Cell Does
This cell analyzes cancellation rates by month to find seasonal patterns.

### 🧠 Learning Concepts
- **Grouping**: Combining data by categories
- **Aggregation**: Calculating statistics (mean, sum, count)
- **Sorting**: Ordering results from low to high

### 📋 Task 5: Analyze Cancellations by Month
- **Goal**: Find which months have highest/lowest cancellation rates
- **Method**: 
  1. Group by `arrival_date_month`
  2. Calculate mean of `is_canceled` (gives percentage)
  3. Sort from lowest to highest

### 🔍 What You'll See
- **Months**: January, February, March, etc.
- **Cancellation Rates**: Percentage for each month
- **Patterns**: Seasonal trends in cancellations

### 💡 Why This Matters
- **Business Insights**: When to expect more cancellations
- **Feature Engineering**: Month could be a useful predictor
- **Marketing**: Target high-risk periods

### 🚀 Run This Cell
Execute this cell to discover seasonal cancellation patterns!


In [None]:
# Task 5: Analyze cancellations by month
monthly_cancellations = df.groupby('arrival_date_month')['is_canceled'].mean()
monthly_cancellations.sort_values()


## 📊 Task 6: Feature Analysis & Selection

### 🎯 What This Cell Does
This cell identifies categorical columns AND analyzes which features are strong predictors of cancellation.

### 🧠 Learning Concepts
- **Object Data Type**: Columns containing text/strings
- **Feature Selection**: Finding predictive features
- **Cancellation Rate Analysis**: Measuring predictive power
- **Range Analysis**: Understanding feature importance

### 📋 Your Task - Two Parts:
**Part A**: Find all columns with 'object' data type
**Part B**: Analyze cancellation rates across all features to identify strong predictors

### 🔍 What You'll Discover:
- **Categorical Columns**: Text-based features that need encoding
- **Feature Rankings**: Which columns predict cancellations best
- **Data Leakage Warning**: Features with perfect prediction (range = 1.0)
- **Recommended Features**: Clean predictors for your model

### 💡 Why This Matters
- **Smart Feature Selection**: Use only meaningful predictors
- **Avoid Data Leakage**: Remove features that "cheat"
- **Model Performance**: Better features = better predictions

### 🚀 Run This Cell
Execute this cell to discover your best predictive features!


In [None]:
# Task 6: Identify and explore categorical columns
categorical_columns = []

# Find all columns with 'object' data type
for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)

print("Categorical columns found:")
print(categorical_columns)
print(f"\nTotal categorical columns: {len(categorical_columns)}")

# Preview unique values for each categorical column
print("\n" + "="*50)
print("CATEGORICAL COLUMN ANALYSIS")
print("="*50)

for col in categorical_columns:
    unique_values = df[col].unique()
    print(f"\n🏷️  {col}")
    print(f"   Unique values: {len(unique_values)}")
    print(f"   Categories: {unique_values[:10]}")  # Show first 10
    if len(unique_values) > 10:
        print(f"   ... and {len(unique_values) - 10} more")

print("\n" + "="*60)
print("FEATURE SELECTION: CANCELLATION RATE ANALYSIS")
print("="*60)

# Analyze cancellation rates across all columns to identify strong predictors
interesting_columns = []

for data_column in df.columns:
    if data_column != "is_canceled":  # Don't analyze target against itself
        cancellation_rate = df.groupby(data_column)["is_canceled"].mean()
        range_difference = cancellation_rate.max() - cancellation_rate.min()
        print(f"{data_column}: range = {range_difference}")
        if range_difference > 0.15:
            interesting_columns.append([data_column, range_difference])

print(f"\n🎯 POTENTIALLY USEFUL FEATURES ({len(interesting_columns)} found):")
print("-" * 50)

# Sort by predictive power (range difference)
interesting_columns.sort(key=lambda x: x[1], reverse=True)

for col_name, range_val in interesting_columns:
    status = "🚨 Very High" if range_val >= 1.0 else "⚡ High" if range_val >= 0.8 else "✅ Good"
    print(f"{status}: {col_name} (range: {range_val:.3f})")

# Filter out problematic columns (range = 1.0 indicates data leakage or extreme outliers)
useful_columns = [col[0] for col in interesting_columns if col[1] < 1.0]
print(f"\n✅ RECOMMENDED FEATURES FOR MODEL ({len(useful_columns)}):")
print(useful_columns)


# 🎉 Phase 1 Complete: Data Exploration

## ✅ What You've Learned So Far
- **Data Loading**: How to read CSV files with pandas
- **Data Inspection**: Understanding data types and missing values
- **Target Analysis**: Exploring what you want to predict
- **Pattern Discovery**: Finding seasonal trends in cancellations

## 📊 Key Insights from Your Analysis
- **Dataset Size**: 119,390 hotel bookings
- **Cancellation Rate**: ~27.8% (varies by month)
- **Three Outcomes**: Check-out, Canceled, No-show
- **Seasonal Patterns**: Some months have higher cancellation rates

## 🚀 Next Phase: Data Preprocessing
Now we'll prepare the data for machine learning:

### Data Preparation Steps
- **Clean the data**: Remove irrelevant columns
- **Encode categories**: Convert text to numbers
- **Handle missing values**: Fill or remove empty cells
- **Create features**: Prepare inputs for neural networks

## 💡 Learning Tip
Take a moment to understand what you've discovered. The patterns you found will help guide your model building!

---

# 🧹 Phase 2: Data Preprocessing

## 🎯 What This Phase Does
Transform raw data into a format that neural networks can understand.

## 🧠 Learning Concepts
- **Feature Engineering**: Creating useful inputs
- **Encoding**: Converting text to numbers
- **Data Cleaning**: Removing irrelevant information
- **One-Hot Encoding**: Creating binary columns for categories

## 📋 Tasks 6-9: Data Cleaning and Preparation
- **Task 6**: Preview categorical columns
- **Task 7**: Drop irrelevant columns
- **Task 8**: Encode meal types
- **Task 9**: Apply one-hot encoding

## 💡 Why This Matters
Neural networks need:
- **Numbers only**: No text allowed
- **No missing values**: Complete data required
- **Relevant features**: Only useful information
