# 🏨 Hotel Cancellation Prediction Project

## 🎯 Learning Objectives
Welcome to your neural network journey! This project will teach you:

### Core Concepts You'll Master
- **Binary Classification**: Predicting Yes/No (Will cancel? Yes/No)
- **Multiclass Classification**: Predicting categories (Check-out/Canceled/No-show)
- **PyTorch Fundamentals**: Building and training neural networks
- **Real-world ML**: Using actual hotel booking data

## 📚 Neural Network Fundamentals

### Architecture Components
- **Layers**: Input → Hidden → Output layers
- **Activation Functions**: ReLU, Sigmoid, Softmax
- **Loss Functions**: Binary Cross-Entropy, Cross-Entropy
- **Optimizers**: Adam optimizer for training
- **Evaluation**: Accuracy, Precision, Recall, F1-Score

## 🚀 Project Structure Overview

### Phase Breakdown
- **Phase 1**: Data Exploration (Tasks 1-6) ✅ **COMPLETED**
- **Phase 2**: Data Preprocessing (Tasks 7-9) 🚧 **IN PROGRESS**  
- **Phase 3**: Model Preparation (Tasks 10-13)
- **Phase 4**: Binary Classification (Tasks 14-18)
- **Phase 5**: Multiclass Classification (Tasks 19-26)

---

## 📊 Dataset Information

### Dataset Details
- **Source**: Kaggle - Hotel Booking Demand Dataset
- **Size**: 119,390 hotel bookings
- **Features**: 32 columns including dates, demographics, pricing
- **Target**: Predict if bookings will be canceled


## 📥 Dataset Download Function

### 🎯 What This Cell Does
This cell creates a reusable function to download Kaggle datasets and organize them in your project.

### 🧠 Learning Concepts
- **Functions**: Reusable code blocks
- **File Management**: Creating folders and copying files
- **Error Handling**: Using `os.makedirs(exist_ok=True)`

### 🚀 Run This Cell
Execute this cell to set up your dataset download function.


In [136]:
import kagglehub
import os
import shutil

def download_kaggle_dataset(dataset_name, project_root=None):
    """
    Download a Kaggle dataset and copy it to the project's data folder.
    
    Args:
        dataset_name (str): Kaggle dataset name in format 'username/dataset-name'
        project_root (str): Path to project root. If None, uses current directory.
    
    Returns:
        str: Path to the local data folder
    """
    if project_root is None:
        project_root = os.getcwd()
    
    # Create data folder if it doesn't exist
    data_folder = os.path.join(project_root, 'data')
    os.makedirs(data_folder, exist_ok=True)
    
    # Download dataset from Kaggle
    print(f"Downloading dataset: {dataset_name}")
    kaggle_path = kagglehub.dataset_download(dataset_name)
    print(f"Kaggle download path: {kaggle_path}")
    
    # Copy files to local data folder
    for file in os.listdir(kaggle_path):
        src = os.path.join(kaggle_path, file)
        dst = os.path.join(data_folder, file)
        if os.path.isfile(src):
            shutil.copy2(src, dst)
            print(f"Copied: {file}")
    
    print(f"Dataset files copied to: {data_folder}")
    return data_folder

# Download the hotel booking dataset
data_path = download_kaggle_dataset("jessemostipak/hotel-booking-demand")


Downloading dataset: jessemostipak/hotel-booking-demand
Kaggle download path: /Users/franciscoteixeirabarbosa/.cache/kagglehub/datasets/jessemostipak/hotel-booking-demand/versions/1
Copied: hotel_bookings.csv
Dataset files copied to: /Users/franciscoteixeirabarbosa/Dropbox/Random_scripts/predict_hotel_cancellations/data


## 📊 Load and Explore Dataset

### 📋 Task 1: Import and Inspect
- **Goal**: Load the CSV file into a pandas DataFrame
- **Method**: Use `pd.read_csv()` to read the file
- **Output**: See the first 5 rows with `.head()`

### 🚀 Run This Cell
Execute this cell to load your dataset and see what you're working with!


In [137]:
import pandas as pd

# Load the dataset from local data folder
df = pd.read_csv("data/hotel_bookings.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Set pandas display options to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("\n=== Full DataFrame Preview (all columns visible) ===")
df.head()


Dataset shape: (119390, 32)
Columns: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

=== Full DataFrame Preview (all columns visible) ===


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


# 🎉 Phase 1 Complete: Data Exploration Summary

## ✅ What You've Accomplished (Tasks 1-6)

### Task 1-2: Data Loading & Inspection ✅
- Loaded 119,390 hotel bookings with 32 features
- Identified data types and missing values
- Found key columns: `is_canceled`, `reservation_status`, `hotel`, etc.

### Task 3-4: Target Analysis ✅  
- **Cancellation Rate**: ~37% overall (44,224 canceled / 75,166 not canceled)
- **Hotel Type Impact**: City hotels (41.7%) vs Resort hotels (27.8%)
- **Three Outcomes**: Check-Out, Canceled, No-Show

### Task 5: Seasonal Patterns ✅
- **Lowest**: January (30.5%), November (31.2%), March (32.2%)
- **Highest**: June (41.5%), April (40.8%), May (39.7%)

### Task 6: Feature Selection ✅
- **18 Recommended Features** for model training
- **9 Problematic Features** removed (data leakage - range = 1.0)
- **12 Categorical Columns** identified for encoding

---

# 🧹 Phase 2: Data Preprocessing

## 🎯 What This Phase Does
Transform raw data into a format that neural networks can understand.

## 🧠 Learning Concepts
- **Feature Engineering**: Creating useful inputs
- **Encoding**: Converting text to numbers
- **Data Cleaning**: Removing irrelevant information
- **One-Hot Encoding**: Creating binary columns for categories

## 📋 Tasks 7-9: Data Cleaning and Preparation
- **Task 7**: Drop irrelevant columns ⬅️ **NEXT**
- **Task 8**: Encode meal types
- **Task 9**: Apply one-hot encoding

## 💡 Why This Matters
Neural networks need:
- **Numbers only**: No text allowed
- **No missing values**: Complete data required
- **Relevant features**: Only useful information


## 🗑️ Task 7: Drop Irrelevant Columns

### 🎯 What This Cell Does
Remove columns that would cause data leakage or are not useful for prediction.

### 🧠 Learning Concepts
- **Data Leakage**: Features that "cheat" by using future information
- **Feature Selection**: Keeping only relevant predictors
- **Too Many Categories**: Columns with excessive unique values

### 📋 Columns to Remove:
**Data Leakage (range = 1.0)**:
- `reservation_status` - This IS the answer for multiclass!
- `reservation_status_date` - Contains outcome information

**Too Many Categories**:
- `country` - 178 unique values (too sparse)
- `agent` - Thousands of unique values
- `company` - Thousands of unique values

**Perfect Predictors (suspicious)**:
- `lead_time`, `stays_in_week_nights`, `days_in_waiting_list`, `adr`

### 🚀 Run This Cell
Execute this cell to clean your dataset for model training!


In [138]:
# Task 7: Drop irrelevant columns for model training

# TODO: Create a list called 'columns_to_drop' with these 9 columns:
# Data leakage columns: 'reservation_status', 'reservation_status_date'
# Too many categories: 'country', 'agent', 'company'  
# Perfect predictors: 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr'


columns_to_drop = ['reservation_status', 'reservation_status_date', 
'country', 'agent', 'company', 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr']

# TODO: Print the original dataset shape using df.shape

print("=== DF.SHAPE BEFORE ===")
print(df.shape)

# TODO: Print the columns_to_drop list and its length
print("List of columns to drop:",columns_to_drop)
print("Length of clumns to drop", len(columns_to_drop))

# TODO: Create df_clean by dropping the columns from df using .drop()
df_clean = df.drop(columns=columns_to_drop)

# TODO: Print the cleaned dataset shape
print("=== DF.SHAPE AFTER ===")
print(df_clean.shape)
# TODO: Print the remaining column names
print(df_clean.columns)

# TODO: Verify 'is_canceled' is still present in df_clean.columns

if df_clean["is_canceled"].any() == True:
    print("is_canceled column is still present")
else:
    print("Not present")

# TODO: Find remaining categorical columns (dtype == 'object') and print them
categorical_columns = df_clean.select_dtypes(include=['object']).columns
print(categorical_columns)


=== DF.SHAPE BEFORE ===
(119390, 32)
List of columns to drop: ['reservation_status', 'reservation_status_date', 'country', 'agent', 'company', 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr']
Length of clumns to drop 9
=== DF.SHAPE AFTER ===
(119390, 23)
Index(['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month',
       'arrival_date_week_number', 'arrival_date_day_of_month',
       'stays_in_weekend_nights', 'adults', 'children', 'babies', 'meal',
       'market_segment', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'reserved_room_type', 'assigned_room_type', 'booking_changes',
       'deposit_type', 'customer_type', 'required_car_parking_spaces',
       'total_of_special_requests'],
      dtype='object')
is_canceled column is still present
Index(['hotel', 'arrival_date_month', 'meal', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
  

## 🍽️ Task 8: Label Encode Meal Column

### 🎯 What This Cell Does
Convert the `meal` column to numbers while preserving the logical order.

### 🧠 Learning Concepts
- **Ordinal Encoding**: When categories have a natural order
- **Label Encoding**: Converting text to numbers
- **Preserving Meaning**: Keeping logical relationships

### 📋 Meal Types (in order):
- **Undefined/SC** → 0 (No meal or self-catering)
- **BB** → 1 (Bed & Breakfast)
- **HB** → 2 (Half Board - breakfast + dinner)
- **FB** → 3 (Full Board - all meals)

### 💡 Why This Order?
More meals = higher service level = potentially different cancellation behavior

### 🚀 Run This Cell
Execute this cell to encode the meal column!


In [139]:
# Task 8: Label encode meal column with meaningful order

# TODO: Print the current meal values using df_clean['meal'].value_counts()
df_clean["meal"].value_counts()

# TODO: Create a meal_mapping dictionary with this order:
# 'Undefined': 0, 'SC': 0, 'BB': 1, 'HB': 2, 'FB': 3
meal_mapping = {
    "SC": 0,
    "Undefined": 0,
    "BB": 1,
    "HB": 2,
    "FB": 3
}

# TODO: Print the meal_mapping dictionary
print(meal_mapping)

# TODO: Create a new column 'meal_encoded' by applying the mapping to 'meal' column
# Hint: use df_clean['meal'].map(meal_mapping)
df_clean["meal_encoded"] = df_clean["meal"].map(meal_mapping)

# TODO: Verify the encoding by showing unique meal/meal_encoded pairs
# Hint: use df_clean[['meal', 'meal_encoded']].drop_duplicates().sort_values('meal_encoded')
print(df_clean[['meal', 'meal_encoded']].drop_duplicates().sort_values('meal_encoded'))
# TODO: Check for missing values in meal_encoded using .isna().sum()
print(df_clean["meal_encoded"].isna().sum())
# TODO: Drop the original 'meal' column using .drop()
df_clean.drop("meal", axis=1)

# TODO: Print the new dataset shape
df_clean.shape


{'SC': 0, 'Undefined': 0, 'BB': 1, 'HB': 2, 'FB': 3}
           meal  meal_encoded
1655         SC             0
3106  Undefined             0
0            BB             1
9            HB             2
7            FB             3
0


(119390, 24)

## 🔢 Task 9: One-Hot Encode Remaining Categorical Columns

### 🎯 What This Cell Does
Convert all remaining categorical columns to binary (0/1) columns.

### 🧠 Learning Concepts
- **One-Hot Encoding**: Creating binary columns for each category
- **Dummy Variables**: 0 = not this category, 1 = is this category
- **Feature Expansion**: More columns, but neural network friendly

### 📋 Example:
```
hotel          →  hotel_City Hotel  hotel_Resort Hotel
City Hotel     →         1                    0
Resort Hotel   →         0                    1
```

### 💡 Why One-Hot Encoding?
- Neural networks work better with binary features
- Prevents false ordinal relationships
- Each category gets equal "weight"

### 🚀 Run This Cell
Execute this cell to create your final dataset for neural networks!


In [140]:
# Task 9: One-hot encode remaining categorical columns

# TODO: Find remaining categorical columns where dtype == 'object'
# Store in a variable called categorical_cols
categorical_cols = df_clean.select_dtypes(include="object").columns
# Check what's ACTUALLY in df_clean right now:
print("Current df_clean columns:")
print(df_clean.columns.tolist())

# Check current categorical columns:
current_categorical = df_clean.select_dtypes(include=['object']).columns
print(f"\nCurrent categorical columns: {current_categorical.tolist()}")
# TODO: Print the categorical columns and their count
print("Current categorical columns:", categorical_cols)
print("Length", len(categorical_cols))
# TODO: Print the current dataset shape (before encoding)
print("Shape of the current daraframe:", df_clean.shape)
# TODO: Apply one-hot encoding using pd.get_dummies()
# Create df_final from df_clean, encode categorical_cols, use prefix=categorical_cols
df_final = pd.get_dummies(df_clean, columns=categorical_cols, prefix=categorical_cols)
df_final.head()
# TODO: Print the shape after one-hot encoding
print("Shape after one-hot encoding:", df_final.shape)
# TODO: Calculate and print how many new columns were created
total_new_columns = len(df_final.columns) - len(df_clean.columns) 

print("New columns created:", total_new_columns)

# TODO: Show the first 10 new column names that were created
df_final.head(10)
# TODO: Verify all columns are now numeric by checking data types
# Hint: use df_final.dtypes.value_counts()
print(df_final.dtypes.value_counts())

# TODO: Check if any object columns remain using .select_dtypes(include=['object'])
print("Columns using Object:", df_final.select_dtypes(include=["object"]))
# TODO: Print final dataset summary (shape, features, samples)
print("===== Final dataset summary =====")
print("Shape:", df_final.shape)
print("Features:", df_final.columns.tolist())
print("Samples:", df_final.sample)
# TODO: Show df_final.head() to preview the final dataset
df_final.head(10)


Current df_clean columns:
['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'adults', 'children', 'babies', 'meal', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'customer_type', 'required_car_parking_spaces', 'total_of_special_requests', 'meal_encoded']

Current categorical columns: ['hotel', 'arrival_date_month', 'meal', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
Current categorical columns: Index(['hotel', 'arrival_date_month', 'meal', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
       'deposit_type', 'customer_type'],
      dtype='object')
Length 9
Shape of the current daraframe: (119390, 24)
Shape after o

Unnamed: 0,is_canceled,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,required_car_parking_spaces,total_of_special_requests,meal_encoded,hotel_City Hotel,hotel_Resort Hotel,arrival_date_month_April,arrival_date_month_August,arrival_date_month_December,arrival_date_month_February,arrival_date_month_January,arrival_date_month_July,arrival_date_month_June,arrival_date_month_March,arrival_date_month_May,arrival_date_month_November,arrival_date_month_October,arrival_date_month_September,meal_BB,meal_FB,meal_HB,meal_SC,meal_Undefined,market_segment_Aviation,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,market_segment_Undefined,distribution_channel_Corporate,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA/TO,distribution_channel_Undefined,reserved_room_type_A,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,reserved_room_type_P,assigned_room_type_A,assigned_room_type_B,assigned_room_type_C,assigned_room_type_D,assigned_room_type_E,assigned_room_type_F,assigned_room_type_G,assigned_room_type_H,assigned_room_type_I,assigned_room_type_K,assigned_room_type_L,assigned_room_type_P,deposit_type_No Deposit,deposit_type_Non Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0,2015,27,1,0,2,0.0,0,0,0,0,3,0,0,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
1,0,2015,27,1,0,2,0.0,0,0,0,0,4,0,0,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
2,0,2015,27,1,0,1,0.0,0,0,0,0,0,0,0,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
3,0,2015,27,1,0,1,0.0,0,0,0,0,0,0,0,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
4,0,2015,27,1,0,2,0.0,0,0,0,0,0,0,1,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
5,0,2015,27,1,0,2,0.0,0,0,0,0,0,0,1,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
6,0,2015,27,1,0,2,0.0,0,0,0,0,0,0,0,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
7,0,2015,27,1,0,2,0.0,0,0,0,0,0,0,1,3,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
8,1,2015,27,1,0,2,0.0,0,0,0,0,0,0,1,1,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False
9,1,2015,27,1,0,2,0.0,0,0,0,0,0,0,0,2,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False


# 🎉 Phase 2 Complete: Data Preprocessing

## ✅ What You've Accomplished (Tasks 7-9)

### Task 7: Data Cleaning ✅
- **Removed 9 problematic columns** (data leakage, too many categories)
- **Kept 18 useful features** for model training
- **Preserved target variable** (`is_canceled`)

### Task 8: Meal Encoding ✅
- **Ordinal encoding** applied to `meal` column
- **Logical order preserved**: Undefined(0) → BB(1) → HB(2) → FB(3)
- **Meaning maintained** while making it numeric

### Task 9: One-Hot Encoding ✅
- **All categorical columns** converted to binary features
- **Neural network ready** - all numeric data
- **Feature expansion** - more columns but ML-friendly format

## 📊 Final Dataset Summary
- **Shape**: ~119,390 samples × ~50+ features
- **All Numeric**: Ready for PyTorch neural networks
- **Clean Data**: No missing values, no text columns
- **Target Preserved**: `is_canceled` column ready for training

## 🚀 Next Phase: Model Preparation
Now we'll prepare the data for PyTorch and build our neural networks!

### Coming Up (Tasks 10-13):
- **Train/Test Split**: Divide data for training and evaluation
- **Feature Scaling**: Normalize data for better training
- **PyTorch Tensors**: Convert to PyTorch format
- **Data Loaders**: Prepare for batch training

## 💡 Great Progress!
You've successfully transformed raw hotel booking data into a clean, numeric dataset ready for machine learning. The hardest part of data science is often the data preparation - and you've nailed it! 🎯


# 🚀 Phase 3: Model Preparation

## 🎯 What This Phase Does
Convert our cleaned pandas data into PyTorch tensors and prepare for neural network training.

## 🧠 Learning Concepts
- **PyTorch Tensors**: GPU-optimized arrays for deep learning
- **Train/Test Split**: Dividing data for training and evaluation
- **Feature Selection**: Choosing input variables vs target variables
- **Data Types**: Float32 for features, Long for classification targets

## 📋 Tasks 10-13: PyTorch Data Preparation
- **Task 10**: Import PyTorch libraries ⬅️ **NEXT**
- **Task 11**: Create feature lists (exclude targets)
- **Task 12**: Convert to PyTorch tensors
- **Task 13**: Split into train/test sets

## 💡 Why This Matters
Neural networks need:
- **Tensors**: Not pandas DataFrames
- **Proper data types**: Float32 for inputs, specific types for targets
- **Train/test separation**: To evaluate model performance fairly


## ⚡ Task 10: Import PyTorch Libraries

### 🎯 What This Cell Does
Import the essential PyTorch modules needed for neural network training.

### 🧠 Learning Concepts
- **torch**: Core PyTorch library for tensors and operations
- **torch.nn**: Neural network layers and loss functions
- **train_test_split**: Scikit-learn function for data splitting

### 📋 Libraries You'll Need:
- Main PyTorch library
- Neural network module from PyTorch
- Train/test split function from sklearn
- (Optional: torch.optim for optimizers)

### 💡 Research Tip
Look up the standard PyTorch import statements - most tutorials start the same way!

### 🚀 Run This Cell
Import the libraries you'll need for building neural networks.


In [141]:
# Task 10: Import PyTorch libraries

# TODO: Import the main PyTorch library (commonly aliased as 'torch')
import torch

# TODO: Import the neural network module from PyTorch
import torch.nn as nn

# TODO: Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# TODO: (Optional) Import the optimizer module from PyTorch
import torch.optim as optim

# TODO: Print PyTorch version to verify installation
# Research: How do you check the version of an imported library?
print(torch.__version__)


2.8.0


## 🎯 Task 11: Create Feature Lists

### 🎯 What This Cell Does
Identify which columns are features (inputs) vs targets (outputs) for your models.

### 🧠 Learning Concepts
- **Features**: Input variables that help predict the outcome
- **Binary Target**: `is_canceled` (0 or 1)
- **Multiclass Target**: `reservation_status` (but we removed it!)
- **List Comprehension**: Efficient way to filter lists in Python

### 📋 What You Need to Figure Out:
- Which column is your binary classification target?
- How do you get all columns EXCEPT the target?
- What happens if a target column doesn't exist in your dataset?

### 💡 Think About:
- You have 76 total columns - how many should be features?
- Why might we need different feature lists for different models?

### 🚀 Run This Cell
Create lists of feature column names for your neural networks.


In [142]:
# Task 11: Create feature lists excluding target variables

# TODO: Create a list called 'binary_features' containing all column names EXCEPT 'is_canceled'
# Hint: You can use list comprehension or pandas methods
binary_features = df_final.drop("is_canceled", axis=1).columns.tolist()
print(binary_features)
# TODO: Print the number of features in your binary_features list
print("Number of features in binary_features:", len(binary_features))
# TODO: Print the first 10 feature names to verify they look correct

# TODO: Verify 'is_canceled' is NOT in your features list
# Research: How do you check if an item is NOT in a Python list?
print("is_canceled" not in binary_features)

# TODO: (Challenge) What would happen if you tried to create multiclass_features 
# excluding 'reservation_status'? Try it and see what happens!

# Let's try creating multiclass_features excluding 'reservation_status'
try:
    multiclass_features = df_final.drop("reservation_status", axis=1)
    print("Success! multiclass_features created with shape:", multiclass_features.shape)
    print("Number of features in multiclass_features:", len(multiclass_features.columns))
except KeyError as e:
    print(f"Error: {e}")
    print("The 'reservation_status' column doesn't exist in df_final!")
    print("Available columns:", list(df_final.columns))
    print("\nThis happens because we removed 'reservation_status' during data preprocessing.")
    print("For multiclass classification, we would need to keep this column as our target.")



['arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'adults', 'children', 'babies', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'required_car_parking_spaces', 'total_of_special_requests', 'meal_encoded', 'hotel_City Hotel', 'hotel_Resort Hotel', 'arrival_date_month_April', 'arrival_date_month_August', 'arrival_date_month_December', 'arrival_date_month_February', 'arrival_date_month_January', 'arrival_date_month_July', 'arrival_date_month_June', 'arrival_date_month_March', 'arrival_date_month_May', 'arrival_date_month_November', 'arrival_date_month_October', 'arrival_date_month_September', 'meal_BB', 'meal_FB', 'meal_HB', 'meal_SC', 'meal_Undefined', 'market_segment_Aviation', 'market_segment_Complementary', 'market_segment_Corporate', 'market_segment_Direct', 'market_segment_Groups', 'market_segment_Offline TA/TO', 'market_segment_Online TA', 'market_segment_Undefined', 'distri

## 🔄 Task 12: Convert to PyTorch Tensors

### 🎯 What This Cell Does
Transform pandas DataFrames into PyTorch tensors with proper data types.

### 🧠 Learning Concepts
- **torch.tensor()**: Converts data to PyTorch format
- **dtype**: Data type specification (float32, long, etc.)
- **GPU compatibility**: Tensors can be moved to GPU later
- **Memory efficiency**: Proper data types save memory

### 📋 Key Questions to Research:
- What data type should features use? (Hint: neural networks like decimals)
- What data type should classification targets use? (Hint: categories are integers)
- How do you select specific columns from a pandas DataFrame?
- What's the difference between .values and .to_numpy()?

### 💡 Think About:
- Why do features need float32 but targets need long/int64?
- What happens if you use the wrong data type?

### 🚀 Run This Cell
Convert your pandas data to PyTorch tensors ready for neural networks.


In [143]:
# Task 12: Create PyTorch tensors with proper data types
binary_features = df_final.drop("is_canceled", axis=1).columns.tolist()  # List of column names
# TODO: Create X tensor from df_final using binary_features columns
# Research: What dtype should features use for neural networks?
# Convert everything to float32 first, then create tensor
X_data = df_final[binary_features].astype('float32')
X = torch.tensor(X_data.values, dtype=torch.float32)


# TODO: Create y tensor from df_final using 'is_canceled' column  
# Research: What dtype should classification targets use?
y = torch.tensor(df_final["is_canceled"].values, dtype=torch.long)


# TODO: Print the shapes of X and y tensors
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
# TODO: Print the data types of X and y tensors
print(f"X dtype: {X.dtype}")
print(f"y dtype: {y.dtype}")


# TODO: Print first 5 rows of X to verify the data looks correct
print(X[:5])
# TODO: Print first 10 values of y to verify the target values
print(y[:10])


X shape: torch.Size([119390, 75])
y shape: torch.Size([119390])
X dtype: torch.float32
y dtype: torch.int64
tensor([[2.0150e+03, 2.7000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.0000e+00, 0.0000e+00,
         0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
 

## ✂️ Task 13: Train/Test Split

### 🎯 What This Cell Does
Divide your data into training and testing sets for unbiased model evaluation.

### 🧠 Learning Concepts
- **Training Set**: Data used to train the neural network (80%)
- **Testing Set**: Data used to evaluate final performance (20%)
- **Random State**: Ensures reproducible results
- **Stratification**: Maintains class balance in splits (optional)

### 📋 Key Concepts to Research:
- Why do we split data before training?
- What does random_state parameter do?
- What's the standard train/test split ratio?
- How does train_test_split handle both X and y simultaneously?

### 💡 Think About:
- Why is it important to split BEFORE looking at the data?
- What would happen if you used all data for training?
- Should the split be 80/20, 70/30, or 90/10?

### 🚀 Run This Cell
Split your tensors into training and testing sets for model development.


In [144]:
# Task 13: Split data into training and testing sets
from sklearn.preprocessing import StandardScaler


# TODO: Use train_test_split to create X_train, X_test, y_train, y_test
# Research: What parameters does train_test_split need?
# Consider: test_size, random_state

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# TODO: Print the shapes of all four resulting datasets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)



# TODO: Calculate and print the percentage split to verify it's correct
if X_train.shape[0] == len(X) - (len(X) * 0.2):

    train_pct = len(X_train) / len(X) * 100
    test_pct = len(X_test) / len(X) * 100
print(f"Train: {train_pct:.1f}%, Test: {test_pct:.1f}%")

# TODO: Check the class distribution in y_train and y_test
# Research: How can you count values in PyTorch tensors?
print("Number of y_train that belongs to class 0:", (y_train == 0).sum().item())
print("Number of y_train that belongs to class 1:", (y_train == 1).sum().item())

# Create scaler and fit on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.numpy())
X_test_scaled = scaler.transform(X_test.numpy())  # Use same scaling

# Convert back to tensors
X_train_scaled = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_scaled = torch.tensor(X_test_scaled, dtype=torch.float32)

# Verify scaling worked
print("Before scaling:")
print(f"X_train range: {X_train.min().item():.2f} to {X_train.max().item():.2f}")
print("After scaling:")
print(f"X_train_scaled range: {X_train_scaled.min().item():.2f} to {X_train_scaled.max().item():.2f}")
print(f"Mean: {X_train_scaled.mean().item():.4f}, Std: {X_train_scaled.std().item():.4f}")


# TODO: Print summary of your prepared data:
# - Total samples, features, train size, test size
print("Total features:", X.shape[1])
print("Total samples for X_train:", len(X_train))
print("Total samples for X_test:", len(X_test))
print("Number of y_test that belongs to class 0:", (y_test == 0).sum().item())
print("Number of y_test that belongs to class 1:", (y_test == 1).sum().item())


Shape of X_train: torch.Size([95512, 75])
Shape of X_test: torch.Size([23878, 75])
Shape of y_train: torch.Size([95512])
Shape of y_test: torch.Size([23878])
Train: 80.0%, Test: 20.0%
Number of y_train that belongs to class 0: 60259
Number of y_train that belongs to class 1: 35253
Before scaling:
X_train range: nan to nan
After scaling:
X_train_scaled range: nan to nan
Mean: nan, Std: nan
Total features: 75
Total samples for X_train: 95512
Total samples for X_test: 23878
Number of y_test that belongs to class 0: 14907
Number of y_test that belongs to class 1: 8971


In [145]:
# Just fix the existing tensors
X_train = torch.nan_to_num(X_train, nan=0.0)
X_test = torch.nan_to_num(X_test, nan=0.0)

print("Quick fix! X_train has NaN:", torch.isnan(X_train).any().item())

Quick fix! X_train has NaN: False


# 🧠 Phase 4: Binary Classification Neural Network

## 🎯 What This Phase Does
Build and train your first neural network to predict hotel booking cancellations (binary classification).

## 🧠 Learning Concepts
- **Neural Network Architecture**: Input → Hidden → Output layers
- **Forward Propagation**: How data flows through the network
- **Backpropagation**: How the network learns from errors
- **Loss Functions**: Binary Cross-Entropy for classification
- **Optimizers**: Adam for efficient gradient descent
- **Training Loop**: Epochs, batches, and convergence

## 📋 Tasks 14-18: Neural Network Development
- **Task 14**: Build network architecture ⬅️ **NEXT**
- **Task 15**: Define loss function and optimizer
- **Task 16**: Train the model with tracking
- **Task 17**: Evaluate on test set
- **Task 18**: Calculate performance metrics

## 💡 Why This Matters
This is where the magic happens:
- **Pattern Recognition**: Network learns to identify cancellation patterns
- **Feature Relationships**: Discovers complex interactions between variables
- **Predictive Power**: Transforms data into actionable business insights


## 🏗️ Task 14: Build Neural Network Architecture

### 🎯 What This Cell Does
Design and create a multi-layer neural network class for binary classification.

### 🧠 Learning Concepts
- **nn.Module**: PyTorch's base class for neural networks
- **Linear Layers**: Fully connected layers (Dense layers)
- **Activation Functions**: ReLU for hidden layers, Sigmoid for output
- **Layer Sizes**: Input(75) → Hidden1(36) → Hidden2(18) → Output(1)

### 📋 Architecture Design Questions:
- How do you inherit from nn.Module?
- What's the difference between `__init__` and `forward` methods?
- Why use ReLU activation in hidden layers?
- Why use Sigmoid activation for binary classification output?
- How do layer sizes relate to your feature count?

### 💡 Think About:
- Why do layer sizes decrease (75 → 36 → 18 → 1)?
- What happens if you use too many or too few layers?
- How does the network learn complex patterns?

### 🚀 Run This Cell
Create your first neural network class ready for training!


In [146]:
# Task 14: Build neural network architecture

# TODO: Create a class that inherits from nn.Module
# Research: What's the standard pattern for PyTorch neural network classes?
class HotelCancellationNet(nn.Module):
    def __init__(self):
        super().__init__()
# TODO: In __init__, define your layers:
# - Layer 1: Input (75) → Hidden (36) 
# - Layer 2: Hidden (36) → Hidden (18)
# - Layer 3: Hidden (18) → Output (1)
# Research: What's nn.Linear? How do you specify input and output sizes?
        self.layer1 = nn.Linear(75, 36)
        self.layer2 = nn.Linear(36, 18)
        self.layer3 = nn.Linear(18, 1)
# TODO: In __init__, define activation functions
# Research: What's nn.ReLU()? What's nn.Sigmoid()?
        self.relu = nn.ReLU()
    
    def forward(self, x):
    # Layer 1: Input → Hidden1 → ReLU
        x = self.layer1(x)
        x = self.relu(x)
        
        # Layer 2: Hidden1 → Hidden2 → ReLU  
        x = self.layer2(x)
        x = self.relu(x)
        
        # Layer 3: Hidden2 → Output logits
        x = self.layer3(x)
        
        return x


# TODO: Implement the forward method
# Research: How does data flow through layers and activations?
# Pattern: layer → activation → layer → activation → final_layer → final_activation

# TODO: Create an instance of your network
# Test: Print the network to see its structure
model = HotelCancellationNet()
print(model)




HotelCancellationNet(
  (layer1): Linear(in_features=75, out_features=36, bias=True)
  (layer2): Linear(in_features=36, out_features=18, bias=True)
  (layer3): Linear(in_features=18, out_features=1, bias=True)
  (relu): ReLU()
)


## ⚙️ Task 15: Define Loss Function and Optimizer

### 🎯 What This Cell Does
Set up the learning components: how to measure errors and how to improve.

### 🧠 Learning Concepts
- **Loss Function**: Measures how wrong the predictions are
- **Binary Cross-Entropy**: Perfect for binary classification problems
- **Optimizer**: Algorithm that adjusts weights to reduce loss
- **Adam**: Adaptive optimizer that works well for most problems
- **Learning Rate**: How big steps the optimizer takes

### 📋 Key Research Questions:
- What's the difference between loss functions for classification vs regression?
- Why is Binary Cross-Entropy ideal for binary classification?
- What does an optimizer actually optimize?
- How does learning rate affect training speed and stability?
- What are Adam's advantages over basic gradient descent?

### 💡 Think About:
- What happens if learning rate is too high or too low?
- Why do we need to specify model parameters in the optimizer?
- How does the loss function connect to backpropagation?

### 🚀 Run This Cell
Set up the learning engine for your neural network!


In [147]:
# Task 15: Define loss function and optimizer

# TODO: Define the loss function for binary classification
# Research: What's nn.BCELoss()? Why is it perfect for binary classification?
# Alternative: Look up nn.BCEWithLogitsLoss() - what's the difference?
loss_function = nn.BCEWithLogitsLoss()

# TODO: Define the optimizer using Adam
# Research: What's torch.optim.Adam()? What parameters does it need?
# Consider: model.parameters(), learning rate (try 0.001 or 0.01)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# TODO: Print information about your setup
# Show: Loss function type, optimizer type, learning rate, parameter count
# TODO: Print information about your setup
# Show: Loss function type, optimizer type, learning rate, parameter count

print("=== Neural Network Setup Information ===")

# 1. Model architecture
print(f"Model: {model}")
print(f"Model class: {model.__class__.__name__}")

# 2. Loss function info
print(f"\nLoss Function: {loss_function}")
print(f"Loss type: {loss_function.__class__.__name__}")

# 3. Optimizer details
print(f"\nOptimizer: {optimizer}")
print(f"Optimizer type: {optimizer.__class__.__name__}")

# 4. Learning rate
print(f"Learning rate: {optimizer.param_groups[0]['lr']}")

# 5. Parameter count (very useful!)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# 6. Model layers breakdown
print(f"\nModel Architecture Details:")
for name, layer in model.named_modules():
    if len(list(layer.children())) == 0:  # Only leaf modules
        print(f"  {name}: {layer}")
# TODO: (Optional) Research different learning rates
# Experiment: What happens with lr=0.1, 0.01, 0.001, 0.0001?
logits = model(X_train_scaled)
probs = torch.sigmoid(logits)
print(f"Prediction range (probabilities): {probs.min().item():.4f} to {probs.max().item():.4f}")


=== Neural Network Setup Information ===
Model: HotelCancellationNet(
  (layer1): Linear(in_features=75, out_features=36, bias=True)
  (layer2): Linear(in_features=36, out_features=18, bias=True)
  (layer3): Linear(in_features=18, out_features=1, bias=True)
  (relu): ReLU()
)
Model class: HotelCancellationNet

Loss Function: BCEWithLogitsLoss()
Loss type: BCEWithLogitsLoss

Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)
Optimizer type: Adam
Learning rate: 0.001

Total parameters: 3,421
Trainable parameters: 3,421

Model Architecture Details:
  layer1: Linear(in_features=75, out_features=36, bias=True)
  layer2: Linear(in_features=36, out_features=18, bias=True)
  layer3: Linear(in_features=18, out_features=1, bias=True)
  relu: ReLU()
Prediction range (probabilities): na

## 🔄 Task 16: Train the Neural Network

### 🎯 What This Cell Does
Implement the training loop where your network learns from the hotel booking data.

### 🧠 Learning Concepts
- **Training Loop**: The heart of machine learning
- **Epochs**: Complete passes through the training data
- **Forward Pass**: Network makes predictions
- **Loss Calculation**: Compare predictions to actual results
- **Backward Pass**: Calculate gradients (backpropagation)
- **Parameter Update**: Optimizer adjusts weights
- **Progress Tracking**: Monitor loss and accuracy over time

### 📋 Training Loop Research:
- What's the standard PyTorch training loop pattern?
- Why do we call `optimizer.zero_grad()`?
- What does `loss.backward()` actually do?
- When do we call `optimizer.step()`?
- How do you track training progress?

### 💡 Think About:
- How many epochs should you train for?
- What does decreasing loss indicate?
- How do you know when training is complete?
- What's overfitting and how do you detect it?

### 🚀 Run This Cell
Watch your neural network learn to predict cancellations!


In [148]:
# Task 16: Train the neural network

# TODO: Set up training parameters
# Decide: How many epochs? (Try 100-1000 for experimentation)
# Consider: How often to print progress? (Every 100 epochs?)
train_loss = []
train_accuracies = []
epochs = 1000

# TODO: Create lists to track training progress
# Track: Loss values, accuracy values (for plotting later)
for epoch in range(epochs):
    train_logits = model(X_train_scaled).squeeze()
    loss = loss_function(train_logits, y_train.float())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    with torch.no_grad():
        train_probs = torch.sigmoid(train_logits)
        predicted_classes = torch.round(train_probs)
        accuracy = (predicted_classes == y_train.float()).float().mean() * 100
        accuracy_value = accuracy.item()

    train_loss.append(loss.item())
    train_accuracies.append(accuracy_value)

    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}/{epochs}")
        print(f"Loss: {loss.item():.4f}")
        print(f"Accuracy: {accuracy_value:.2f}%")
        print("-" * 30)

# Check what your model is actually predicting
with torch.no_grad():
    sample_logits = model(X_train_scaled[:100])
    sample_probs = torch.sigmoid(sample_logits)
    print("Sample predictions (probabilities):")
    print("Min:", sample_probs.min().item())
    print("Max:", sample_probs.max().item())
    print("Mean:", sample_probs.mean().item())
    print("Unique values:", torch.unique(torch.round(sample_probs)).shape[0])


# TODO: Calculate training accuracy during training
# Research: How do you convert probabilities to predictions?
# Hint: Sigmoid output > 0.5 = class 1, otherwise class 0

# TODO: Print progress periodically
# Show: Epoch number, loss value, accuracy percentage



Epoch 100/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 200/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 300/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 400/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 500/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 600/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 700/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 800/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 900/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Epoch 1000/1000
Loss: nan
Accuracy: 0.00%
------------------------------
Sample predictions (probabilities):
Min: nan
Max: nan
Mean: nan
Unique values: 100


## 🔧 **CRITICAL FIX: Feature Scaling Applied**

### **Problem Identified:**
- **Raw features** had values in thousands (e.g., `arrival_date_year`: 2015-2017)
- **Network weights** initialized to ~[-0.1, 0.1]
- **Result**: Massive first layer outputs → Sigmoid saturation → All predictions = 1.0

### **Solution Implemented:**
- ✅ **StandardScaler**: Normalized all features to mean=0, std=1
- ✅ **Updated all code**: Now uses `X_train_scaled` and `X_test_scaled`
- ✅ **BCEWithLogitsLoss**: More numerically stable loss function
- ✅ **Proper gradient flow**: Network can now learn effectively

### **Expected Results:**
- **Before**: Loss stuck at 63.09, Accuracy stuck at 36.91%
- **After**: Loss should decrease, Accuracy should improve to 70-85%

---


## 📊 Task 17: Evaluate Model Performance

### 🎯 What This Cell Does
Test your trained neural network on unseen data to measure real-world performance.

### 🧠 Learning Concepts
- **Model Evaluation**: Testing on data the model has never seen
- **Test Set**: Unbiased performance measurement
- **Inference Mode**: Disable training-specific behaviors
- **Prediction Conversion**: Probabilities → Binary predictions
- **Performance Baseline**: How good is "good enough"?

### 📋 Evaluation Research:
- Why do we evaluate on test data, not training data?
- What's `model.eval()` and why do we need it?
- How do you convert sigmoid outputs to class predictions?
- What's the difference between training and inference?
- How do you interpret accuracy percentages?

### 💡 Think About:
- What accuracy would you expect by random guessing?
- Is 70% accuracy good for this problem?
- How does test accuracy compare to training accuracy?
- What does overfitting look like in the results?

### 🚀 Run This Cell
Discover how well your neural network predicts hotel cancellations!


In [135]:
# Task 17: Evaluate model on test set

# TODO: Set model to evaluation mode
# Research: What does model.eval() do? Why is it important?
model.eval()

# TODO: Make predictions on test set
# Research: How do you disable gradient computation during inference?
# Hint: Look up torch.no_grad() context manager
with torch.no_grad():
    test_logits = model(X_test_scaled)  # Critical: Use SCALED test data!
    test_probs = torch.sigmoid(test_logits)
    predicted_classes = torch.round(test_probs.squeeze())
    test_accuracy = (predicted_classes == y_test.float()).float().mean() * 100
    test_loss = loss_function(test_logits.squeeze(), y_test.float())

# TODO: Convert predictions to binary classes
# Logic: Sigmoid output > 0.5 = class 1, otherwise class 0
# Research: How do you apply threshold to tensor values?

# TODO: Calculate test accuracy
# Formula: Correct predictions / Total predictions
# Research: How do you count matching values in PyTorch tensors?

# TODO: Calculate test loss for comparison
# Compare: How does test loss compare to final training loss?

# TODO: Print comprehensive results
# Show: Test accuracy, test loss, number of correct predictions
# Analysis: Is this better than random guessing (50%)?
# Results
print("🎯 Test Set Evaluation Results:")
print(f"Test Accuracy: {test_accuracy:.2f}%")
print(f"Test Loss: {test_loss.item():.4f}")
if train_accuracies:
    print(f"Training Accuracy: {train_accuracies[-1]:.2f}% (for comparison)")
else:
    print("Training Accuracy: N/A (run training first)")
if train_loss:
    print(f"Training Loss: {train_loss[-1]:.4f} (for comparison)")
else:
    print("Training Loss: N/A (run training first)")



🎯 Test Set Evaluation Results:
Test Accuracy: 0.00%
Test Loss: nan
Training Accuracy: 0.00% (for comparison)
Training Loss: nan (for comparison)


## 🎯 Task 18: Advanced Performance Metrics

### 🎯 What This Cell Does
Calculate comprehensive metrics to deeply understand your model's strengths and weaknesses.

### 🧠 Learning Concepts
- **Confusion Matrix**: True/False Positives and Negatives
- **Precision**: Of predicted cancellations, how many were correct?
- **Recall**: Of actual cancellations, how many did we catch?
- **F1-Score**: Balance between precision and recall
- **Business Impact**: What do these metrics mean for hotels?

### 📋 Metrics Research:
- What's the difference between accuracy and precision?
- When is high recall more important than high precision?
- What does F1-score tell you that accuracy doesn't?
- How do you interpret a confusion matrix?
- Which metric matters most for business decisions?

### 💡 Business Context:
- **False Positive**: Predict cancellation, but guest shows up
- **False Negative**: Miss a cancellation, lose revenue opportunity  
- **Hotel Perspective**: Which error is more costly?

### 🚀 Run This Cell
Get professional-grade insights into your model's performance!


In [81]:
# Task 18: Calculate comprehensive performance metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# TODO: Import sklearn metrics for professional evaluation
# Research: What's sklearn.metrics? Which functions do you need?
# Consider: accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Convert PyTorch tensors to numpy arrays for sklearn compatibility
y_test_np = y_test.detach().numpy()
y_test_np = y_test_np.ravel()
# Get predictions from the model
with torch.no_grad():
    y_pred_logits = model(X_test_scaled)
    y_pred_probs = torch.sigmoid(y_pred_logits)
    y_pred = (y_pred_probs > 0.5).float()  # Convert probabilities to binary predictions

y_pred_np = y_pred.detach().numpy()
y_pred_np = y_pred_np.ravel()


# Calculate all classification metrics
accuracy = accuracy_score(y_test_np, y_pred_np)
precision = precision_score(y_test_np, y_pred_np)
recall = recall_score(y_test_np, y_pred_np)
f1 = f1_score(y_test_np, y_pred_np)

print("🎯 Model Performance Metrics:")
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f} ({precision*100:.2f}%)")
print(f"Recall:    {recall:.4f} ({recall*100:.2f}%)")
print(f"F1-Score:  {f1:.4f} ({f1*100:.2f}%)")

# TODO: Create and display confusion matrix
# Research: How do you interpret a 2x2 confusion matrix?
# Format: [[True Neg, False Pos], [False Neg, True Pos]]
cm = confusion_matrix(y_test_np, y_pred_np)
print("\n📊 Confusion Matrix:")
print("    Predicted")
print("    0    1")
print(f"0  {cm[0,0]:4d} {cm[0,1]:4d}  Actual")
print(f"1  {cm[1,0]:4d} {cm[1,1]:4d}")
print("\nInterpretation:")
print(f"True Negatives:  {cm[0,0]} (Correctly predicted no cancellation)")
print(f"False Positives: {cm[0,1]} (Incorrectly predicted cancellation)")
print(f"False Negatives: {cm[1,0]} (Missed actual cancellations)")
print(f"True Positives:  {cm[1,1]} (Correctly predicted cancellation)")

# Business-relevant insights and metric importance analysis
print("\n💼 Business Impact Analysis:")
print("=" * 50)

# Calculate cost implications
false_positive_rate = cm[0,1] / (cm[0,0] + cm[0,1])
false_negative_rate = cm[1,0] / (cm[1,0] + cm[1,1])

print(f"False Positive Rate: {false_positive_rate:.4f} ({false_positive_rate*100:.2f}%)")
print(f"False Negative Rate: {false_negative_rate:.4f} ({false_negative_rate*100:.2f}%)")

print("\n🎯 Which Metric Matters Most for Hotels?")
print("=" * 45)
print("📈 RECALL is typically most critical because:")
print("   • Missing cancellations (False Negatives) = Lost revenue opportunity")
print("   • Hotels can't resell rooms last-minute")
print("   • Overbooking strategies depend on accurate cancellation prediction")
print(f"   • Current Recall: {recall:.4f} - We're catching {recall*100:.2f}% of cancellations")

print("\n⚖️ PRECISION matters for operational efficiency:")
print("   • False alarms (False Positives) = Unnecessary overbooking")
print("   • Can lead to guest dissatisfaction if overbooked")
print(f"   • Current Precision: {precision:.4f} - {precision*100:.2f}% of predictions are correct")

print("\n🏆 RECOMMENDED PRIMARY METRIC: F1-Score")
print(f"   • Balances both recall and precision: {f1:.4f}")
print("   • Prevents over-optimization of one metric at expense of other")
print("   • Best for business decision-making in hospitality industry")

# Calculate revenue impact estimates
total_cancellations = cm[1,0] + cm[1,1]
missed_cancellations = cm[1,0]
avg_room_rate = 100  # Assumption for demonstration

print(f"\n💰 Estimated Revenue Impact:")
print(f"   • Missed {missed_cancellations} out of {total_cancellations} cancellations")
print(f"   • Potential lost revenue: ${missed_cancellations * avg_room_rate:,}")
print(f"   • Model effectiveness: {(1-false_negative_rate)*100:.1f}% of cancellations caught")

# TODO: Print a professional performance report
# Include: All metrics, confusion matrix, business interpretation
# Compare: How does this compare to industry benchmarks?


🎯 Model Performance Metrics:
Accuracy:  0.3757 (37.57%)
Precision: 0.3757 (37.57%)
Recall:    1.0000 (100.00%)
F1-Score:  0.5462 (54.62%)

📊 Confusion Matrix:
    Predicted
    0    1
0     0 14907  Actual
1     0 8971

Interpretation:
True Negatives:  0 (Correctly predicted no cancellation)
False Positives: 14907 (Incorrectly predicted cancellation)
False Negatives: 0 (Missed actual cancellations)
True Positives:  8971 (Correctly predicted cancellation)

💼 Business Impact Analysis:
False Positive Rate: 1.0000 (100.00%)
False Negative Rate: 0.0000 (0.00%)

🎯 Which Metric Matters Most for Hotels?
📈 RECALL is typically most critical because:
   • Missing cancellations (False Negatives) = Lost revenue opportunity
   • Hotels can't resell rooms last-minute
   • Overbooking strategies depend on accurate cancellation prediction
   • Current Recall: 1.0000 - We're catching 100.00% of cancellations

⚖️ PRECISION matters for operational efficiency:
   • False alarms (False Positives) = Unnecess