# 🏨 Hotel Cancellation Prediction Project

## 🎯 Learning Objectives
Welcome to your neural network journey! This project will teach you:

### Core Concepts You'll Master
- **Binary Classification**: Predicting Yes/No (Will cancel? Yes/No)
- **Multiclass Classification**: Predicting categories (Check-out/Canceled/No-show)
- **PyTorch Fundamentals**: Building and training neural networks
- **Real-world ML**: Using actual hotel booking data

## 📚 Neural Network Fundamentals

### Architecture Components
- **Layers**: Input → Hidden → Output layers
- **Activation Functions**: ReLU, Sigmoid, Softmax
- **Loss Functions**: Binary Cross-Entropy, Cross-Entropy
- **Optimizers**: Adam optimizer for training
- **Evaluation**: Accuracy, Precision, Recall, F1-Score

## 🚀 Project Structure Overview

### Phase Breakdown
- **Phase 1**: Data Exploration (Tasks 1-6) ✅ **COMPLETED**
- **Phase 2**: Data Preprocessing (Tasks 7-9) 🚧 **IN PROGRESS**  
- **Phase 3**: Model Preparation (Tasks 10-13)
- **Phase 4**: Binary Classification (Tasks 14-18)
- **Phase 5**: Multiclass Classification (Tasks 19-26)

---

## 📊 Dataset Information

### Dataset Details
- **Source**: Kaggle - Hotel Booking Demand Dataset
- **Size**: 119,390 hotel bookings
- **Features**: 32 columns including dates, demographics, pricing
- **Target**: Predict if bookings will be canceled


## 📥 Dataset Download Function

### 🎯 What This Cell Does
This cell creates a reusable function to download Kaggle datasets and organize them in your project.

### 🧠 Learning Concepts
- **Functions**: Reusable code blocks
- **File Management**: Creating folders and copying files
- **Error Handling**: Using `os.makedirs(exist_ok=True)`

### 🚀 Run This Cell
Execute this cell to set up your dataset download function.


In [5]:
import kagglehub
import os
import shutil

def download_kaggle_dataset(dataset_name, project_root=None):
    """
    Download a Kaggle dataset and copy it to the project's data folder.
    
    Args:
        dataset_name (str): Kaggle dataset name in format 'username/dataset-name'
        project_root (str): Path to project root. If None, uses current directory.
    
    Returns:
        str: Path to the local data folder
    """
    if project_root is None:
        project_root = os.getcwd()
    
    # Create data folder if it doesn't exist
    data_folder = os.path.join(project_root, 'data')
    os.makedirs(data_folder, exist_ok=True)
    
    # Download dataset from Kaggle
    print(f"Downloading dataset: {dataset_name}")
    kaggle_path = kagglehub.dataset_download(dataset_name)
    print(f"Kaggle download path: {kaggle_path}")
    
    # Copy files to local data folder
    for file in os.listdir(kaggle_path):
        src = os.path.join(kaggle_path, file)
        dst = os.path.join(data_folder, file)
        if os.path.isfile(src):
            shutil.copy2(src, dst)
            print(f"Copied: {file}")
    
    print(f"Dataset files copied to: {data_folder}")
    return data_folder

# Download the hotel booking dataset
data_path = download_kaggle_dataset("jessemostipak/hotel-booking-demand")


Downloading dataset: jessemostipak/hotel-booking-demand
Kaggle download path: /Users/franciscoteixeirabarbosa/.cache/kagglehub/datasets/jessemostipak/hotel-booking-demand/versions/1
Copied: hotel_bookings.csv
Dataset files copied to: /Users/franciscoteixeirabarbosa/Dropbox/Random_scripts/predict_hotel_cancellations/data


## 📊 Load and Explore Dataset

### 📋 Task 1: Import and Inspect
- **Goal**: Load the CSV file into a pandas DataFrame
- **Method**: Use `pd.read_csv()` to read the file
- **Output**: See the first 5 rows with `.head()`

### 🚀 Run This Cell
Execute this cell to load your dataset and see what you're working with!


In [6]:
import pandas as pd

# Load the dataset from local data folder
df = pd.read_csv("data/hotel_bookings.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Set pandas display options to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("\n=== Full DataFrame Preview (all columns visible) ===")
df.head()


Dataset shape: (119390, 32)
Columns: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

=== Full DataFrame Preview (all columns visible) ===


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


# 🎉 Phase 1 Complete: Data Exploration Summary

## ✅ What You've Accomplished (Tasks 1-6)

### Task 1-2: Data Loading & Inspection ✅
- Loaded 119,390 hotel bookings with 32 features
- Identified data types and missing values
- Found key columns: `is_canceled`, `reservation_status`, `hotel`, etc.

### Task 3-4: Target Analysis ✅  
- **Cancellation Rate**: ~37% overall (44,224 canceled / 75,166 not canceled)
- **Hotel Type Impact**: City hotels (41.7%) vs Resort hotels (27.8%)
- **Three Outcomes**: Check-Out, Canceled, No-Show

### Task 5: Seasonal Patterns ✅
- **Lowest**: January (30.5%), November (31.2%), March (32.2%)
- **Highest**: June (41.5%), April (40.8%), May (39.7%)

### Task 6: Feature Selection ✅
- **18 Recommended Features** for model training
- **9 Problematic Features** removed (data leakage - range = 1.0)
- **12 Categorical Columns** identified for encoding

---

# 🧹 Phase 2: Data Preprocessing

## 🎯 What This Phase Does
Transform raw data into a format that neural networks can understand.

## 🧠 Learning Concepts
- **Feature Engineering**: Creating useful inputs
- **Encoding**: Converting text to numbers
- **Data Cleaning**: Removing irrelevant information
- **One-Hot Encoding**: Creating binary columns for categories

## 📋 Tasks 7-9: Data Cleaning and Preparation
- **Task 7**: Drop irrelevant columns ⬅️ **NEXT**
- **Task 8**: Encode meal types
- **Task 9**: Apply one-hot encoding

## 💡 Why This Matters
Neural networks need:
- **Numbers only**: No text allowed
- **No missing values**: Complete data required
- **Relevant features**: Only useful information


## 🗑️ Task 7: Drop Irrelevant Columns

### 🎯 What This Cell Does
Remove columns that would cause data leakage or are not useful for prediction.

### 🧠 Learning Concepts
- **Data Leakage**: Features that "cheat" by using future information
- **Feature Selection**: Keeping only relevant predictors
- **Too Many Categories**: Columns with excessive unique values

### 📋 Columns to Remove:
**Data Leakage (range = 1.0)**:
- `reservation_status` - This IS the answer for multiclass!
- `reservation_status_date` - Contains outcome information

**Too Many Categories**:
- `country` - 178 unique values (too sparse)
- `agent` - Thousands of unique values
- `company` - Thousands of unique values

**Perfect Predictors (suspicious)**:
- `lead_time`, `stays_in_week_nights`, `days_in_waiting_list`, `adr`

### 🚀 Run This Cell
Execute this cell to clean your dataset for model training!


In [25]:
# Task 7: Drop irrelevant columns for model training

# TODO: Create a list called 'columns_to_drop' with these 9 columns:
# Data leakage columns: 'reservation_status', 'reservation_status_date'
# Too many categories: 'country', 'agent', 'company'  
# Perfect predictors: 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr'


columns_to_drop = ['reservation_status', 'reservation_status_date', 
'country', 'agent', 'company', 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr']

# TODO: Print the original dataset shape using df.shape

print("=== DF.SHAPE BEFORE ===")
print(df.shape)

# TODO: Print the columns_to_drop list and its length
print("List of columns to drop:",columns_to_drop)
print("Length of clumns to drop", len(columns_to_drop))

# TODO: Create df_clean by dropping the columns from df using .drop()
df_clean = df.drop(columns=columns_to_drop)

# TODO: Print the cleaned dataset shape
print("=== DF.SHAPE AFTER ===")
print(df_clean.shape)
# TODO: Print the remaining column names
print(df_clean.columns)

# TODO: Verify 'is_canceled' is still present in df_clean.columns

if df_clean["is_canceled"].any() == True:
    print("is_canceled column is still present")
else:
    print("Not present")

# TODO: Find remaining categorical columns (dtype == 'object') and print them
categorical_columns = df_clean.select_dtypes(include=['object']).columns
print(categorical_columns)


=== DF.SHAPE BEFORE ===
(119390, 32)
List of columns to drop: ['reservation_status', 'reservation_status_date', 'country', 'agent', 'company', 'lead_time', 'stays_in_week_nights', 'days_in_waiting_list', 'adr']
Length of clumns to drop 9
=== DF.SHAPE AFTER ===
(119390, 23)
Index(['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month',
       'arrival_date_week_number', 'arrival_date_day_of_month',
       'stays_in_weekend_nights', 'adults', 'children', 'babies', 'meal',
       'market_segment', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'reserved_room_type', 'assigned_room_type', 'booking_changes',
       'deposit_type', 'customer_type', 'required_car_parking_spaces',
       'total_of_special_requests'],
      dtype='object')
is_canceled column is still present
Index(['hotel', 'arrival_date_month', 'meal', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
  

## 🍽️ Task 8: Label Encode Meal Column

### 🎯 What This Cell Does
Convert the `meal` column to numbers while preserving the logical order.

### 🧠 Learning Concepts
- **Ordinal Encoding**: When categories have a natural order
- **Label Encoding**: Converting text to numbers
- **Preserving Meaning**: Keeping logical relationships

### 📋 Meal Types (in order):
- **Undefined/SC** → 0 (No meal or self-catering)
- **BB** → 1 (Bed & Breakfast)
- **HB** → 2 (Half Board - breakfast + dinner)
- **FB** → 3 (Full Board - all meals)

### 💡 Why This Order?
More meals = higher service level = potentially different cancellation behavior

### 🚀 Run This Cell
Execute this cell to encode the meal column!


In [38]:
# Task 8: Label encode meal column with meaningful order

# TODO: Print the current meal values using df_clean['meal'].value_counts()
df_clean["meal"].value_counts()

# TODO: Create a meal_mapping dictionary with this order:
# 'Undefined': 0, 'SC': 0, 'BB': 1, 'HB': 2, 'FB': 3
meal_mapping = {
    "SC": 0,
    "Undefined": 0,
    "BB": 1,
    "HB": 2,
    "FB": 3
}

# TODO: Print the meal_mapping dictionary
print(meal_mapping)

# TODO: Create a new column 'meal_encoded' by applying the mapping to 'meal' column
# Hint: use df_clean['meal'].map(meal_mapping)
df_clean["meal_encoded"] = df_clean["meal"].map(meal_mapping)

# TODO: Verify the encoding by showing unique meal/meal_encoded pairs
# Hint: use df_clean[['meal', 'meal_encoded']].drop_duplicates().sort_values('meal_encoded')
print(df_clean[['meal', 'meal_encoded']].drop_duplicates().sort_values('meal_encoded'))
# TODO: Check for missing values in meal_encoded using .isna().sum()
print(df_clean["meal_encoded"].isna().sum())
# TODO: Drop the original 'meal' column using .drop()
df_clean.drop("meal", axis=1)

# TODO: Print the new dataset shape
df_clean.shape


{'SC': 0, 'Undefined': 0, 'BB': 1, 'HB': 2, 'FB': 3}
           meal  meal_encoded
1655         SC             0
3106  Undefined             0
0            BB             1
9            HB             2
7            FB             3
0


(119390, 24)

## 🔢 Task 9: One-Hot Encode Remaining Categorical Columns

### 🎯 What This Cell Does
Convert all remaining categorical columns to binary (0/1) columns.

### 🧠 Learning Concepts
- **One-Hot Encoding**: Creating binary columns for each category
- **Dummy Variables**: 0 = not this category, 1 = is this category
- **Feature Expansion**: More columns, but neural network friendly

### 📋 Example:
```
hotel          →  hotel_City Hotel  hotel_Resort Hotel
City Hotel     →         1                    0
Resort Hotel   →         0                    1
```

### 💡 Why One-Hot Encoding?
- Neural networks work better with binary features
- Prevents false ordinal relationships
- Each category gets equal "weight"

### 🚀 Run This Cell
Execute this cell to create your final dataset for neural networks!


In [None]:
# Task 9: One-hot encode remaining categorical columns

# TODO: Find remaining categorical columns where dtype == 'object'
# Store in a variable called categorical_cols
categorical_cols = df_clean.select_dtypes(include="object").columns
# Check what's ACTUALLY in df_clean right now:
print("Current df_clean columns:")
print(df_clean.columns.tolist())

# Check current categorical columns:
current_categorical = df_clean.select_dtypes(include=['object']).columns
print(f"\nCurrent categorical columns: {current_categorical.tolist()}")
# TODO: Print the categorical columns and their count
print("Current categorical columns:", categorical_cols)
print("Length", len(categorical_cols))
# TODO: Print the current dataset shape (before encoding)
print("Shape of the current daraframe:", df_clean.shape)
# TODO: Apply one-hot encoding using pd.get_dummies()
# Create df_final from df_clean, encode categorical_cols, use prefix=categorical_cols
df_final = pd.get_dummies(df_clean, columns=categorical_cols, prefix=categorical_cols)
df_final.head()
# TODO: Print the shape after one-hot encoding
print("Shape after one-hot encoding:", df_final.shape)
# TODO: Calculate and print how many new columns were created
total_new_columns = len(df_final.columns) - len(df_clean.columns) 

print("New columns created:", total_new_columns)

# TODO: Show the first 10 new column names that were created
df_final.head(10)
# TODO: Verify all columns are now numeric by checking data types
# Hint: use df_final.dtypes.value_counts()
print(df_final.dtypes.value_counts())

# TODO: Check if any object columns remain using .select_dtypes(include=['object'])
print("Columns using Object:", df_final.select_dtypes(include=["object"]))
# TODO: Print final dataset summary (shape, features, samples)
print("===== Final dataset summary =====")
print("Shape:", df_final.shape)
print("Features:", df_final.columns.tolist())
print("Samples:", df_final.sample)
# TODO: Show df_final.head() to preview the final dataset
df_final.head(10)


Current df_clean columns:
['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'adults', 'children', 'babies', 'meal', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'customer_type', 'required_car_parking_spaces', 'total_of_special_requests', 'meal_encoded']

Current categorical columns: ['hotel', 'arrival_date_month', 'meal', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
Current categorical columns: Index(['hotel', 'arrival_date_month', 'meal', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
       'deposit_type', 'customer_type'],
      dtype='object')
Length 9
Shape of the current daraframe: (119390, 24)
Shape after o

# 🎉 Phase 2 Complete: Data Preprocessing

## ✅ What You've Accomplished (Tasks 7-9)

### Task 7: Data Cleaning ✅
- **Removed 9 problematic columns** (data leakage, too many categories)
- **Kept 18 useful features** for model training
- **Preserved target variable** (`is_canceled`)

### Task 8: Meal Encoding ✅
- **Ordinal encoding** applied to `meal` column
- **Logical order preserved**: Undefined(0) → BB(1) → HB(2) → FB(3)
- **Meaning maintained** while making it numeric

### Task 9: One-Hot Encoding ✅
- **All categorical columns** converted to binary features
- **Neural network ready** - all numeric data
- **Feature expansion** - more columns but ML-friendly format

## 📊 Final Dataset Summary
- **Shape**: ~119,390 samples × ~50+ features
- **All Numeric**: Ready for PyTorch neural networks
- **Clean Data**: No missing values, no text columns
- **Target Preserved**: `is_canceled` column ready for training

## 🚀 Next Phase: Model Preparation
Now we'll prepare the data for PyTorch and build our neural networks!

### Coming Up (Tasks 10-13):
- **Train/Test Split**: Divide data for training and evaluation
- **Feature Scaling**: Normalize data for better training
- **PyTorch Tensors**: Convert to PyTorch format
- **Data Loaders**: Prepare for batch training

## 💡 Great Progress!
You've successfully transformed raw hotel booking data into a clean, numeric dataset ready for machine learning. The hardest part of data science is often the data preparation - and you've nailed it! 🎯


# 🚀 Phase 3: Model Preparation

## 🎯 What This Phase Does
Convert our cleaned pandas data into PyTorch tensors and prepare for neural network training.

## 🧠 Learning Concepts
- **PyTorch Tensors**: GPU-optimized arrays for deep learning
- **Train/Test Split**: Dividing data for training and evaluation
- **Feature Selection**: Choosing input variables vs target variables
- **Data Types**: Float32 for features, Long for classification targets

## 📋 Tasks 10-13: PyTorch Data Preparation
- **Task 10**: Import PyTorch libraries ⬅️ **NEXT**
- **Task 11**: Create feature lists (exclude targets)
- **Task 12**: Convert to PyTorch tensors
- **Task 13**: Split into train/test sets

## 💡 Why This Matters
Neural networks need:
- **Tensors**: Not pandas DataFrames
- **Proper data types**: Float32 for inputs, specific types for targets
- **Train/test separation**: To evaluate model performance fairly


## ⚡ Task 10: Import PyTorch Libraries

### 🎯 What This Cell Does
Import the essential PyTorch modules needed for neural network training.

### 🧠 Learning Concepts
- **torch**: Core PyTorch library for tensors and operations
- **torch.nn**: Neural network layers and loss functions
- **train_test_split**: Scikit-learn function for data splitting

### 📋 Libraries You'll Need:
- Main PyTorch library
- Neural network module from PyTorch
- Train/test split function from sklearn
- (Optional: torch.optim for optimizers)

### 💡 Research Tip
Look up the standard PyTorch import statements - most tutorials start the same way!

### 🚀 Run This Cell
Import the libraries you'll need for building neural networks.


In [None]:
# Task 10: Import PyTorch libraries

# TODO: Import the main PyTorch library (commonly aliased as 'torch')

# TODO: Import the neural network module from PyTorch

# TODO: Import train_test_split from sklearn.model_selection

# TODO: (Optional) Import the optimizer module from PyTorch

# TODO: Print PyTorch version to verify installation
# Research: How do you check the version of an imported library?


## 🎯 Task 11: Create Feature Lists

### 🎯 What This Cell Does
Identify which columns are features (inputs) vs targets (outputs) for your models.

### 🧠 Learning Concepts
- **Features**: Input variables that help predict the outcome
- **Binary Target**: `is_canceled` (0 or 1)
- **Multiclass Target**: `reservation_status` (but we removed it!)
- **List Comprehension**: Efficient way to filter lists in Python

### 📋 What You Need to Figure Out:
- Which column is your binary classification target?
- How do you get all columns EXCEPT the target?
- What happens if a target column doesn't exist in your dataset?

### 💡 Think About:
- You have 76 total columns - how many should be features?
- Why might we need different feature lists for different models?

### 🚀 Run This Cell
Create lists of feature column names for your neural networks.


In [None]:
# Task 11: Create feature lists excluding target variables

# TODO: Create a list called 'binary_features' containing all column names EXCEPT 'is_canceled'
# Hint: You can use list comprehension or pandas methods

# TODO: Print the number of features in your binary_features list

# TODO: Print the first 10 feature names to verify they look correct

# TODO: Verify 'is_canceled' is NOT in your features list
# Research: How do you check if an item is NOT in a Python list?

# TODO: (Challenge) What would happen if you tried to create multiclass_features 
# excluding 'reservation_status'? Try it and see what happens!


## 🔄 Task 12: Convert to PyTorch Tensors

### 🎯 What This Cell Does
Transform pandas DataFrames into PyTorch tensors with proper data types.

### 🧠 Learning Concepts
- **torch.tensor()**: Converts data to PyTorch format
- **dtype**: Data type specification (float32, long, etc.)
- **GPU compatibility**: Tensors can be moved to GPU later
- **Memory efficiency**: Proper data types save memory

### 📋 Key Questions to Research:
- What data type should features use? (Hint: neural networks like decimals)
- What data type should classification targets use? (Hint: categories are integers)
- How do you select specific columns from a pandas DataFrame?
- What's the difference between .values and .to_numpy()?

### 💡 Think About:
- Why do features need float32 but targets need long/int64?
- What happens if you use the wrong data type?

### 🚀 Run This Cell
Convert your pandas data to PyTorch tensors ready for neural networks.


In [None]:
# Task 12: Create PyTorch tensors with proper data types

# TODO: Create X tensor from df_final using binary_features columns
# Research: What dtype should features use for neural networks?

# TODO: Create y tensor from df_final using 'is_canceled' column  
# Research: What dtype should classification targets use?

# TODO: Print the shapes of X and y tensors

# TODO: Print the data types of X and y tensors

# TODO: Print first 5 rows of X to verify the data looks correct

# TODO: Print first 10 values of y to verify the target values


## ✂️ Task 13: Train/Test Split

### 🎯 What This Cell Does
Divide your data into training and testing sets for unbiased model evaluation.

### 🧠 Learning Concepts
- **Training Set**: Data used to train the neural network (80%)
- **Testing Set**: Data used to evaluate final performance (20%)
- **Random State**: Ensures reproducible results
- **Stratification**: Maintains class balance in splits (optional)

### 📋 Key Concepts to Research:
- Why do we split data before training?
- What does random_state parameter do?
- What's the standard train/test split ratio?
- How does train_test_split handle both X and y simultaneously?

### 💡 Think About:
- Why is it important to split BEFORE looking at the data?
- What would happen if you used all data for training?
- Should the split be 80/20, 70/30, or 90/10?

### 🚀 Run This Cell
Split your tensors into training and testing sets for model development.


In [None]:
# Task 13: Split data into training and testing sets

# TODO: Use train_test_split to create X_train, X_test, y_train, y_test
# Research: What parameters does train_test_split need?
# Consider: test_size, random_state

# TODO: Print the shapes of all four resulting datasets

# TODO: Calculate and print the percentage split to verify it's correct

# TODO: Check the class distribution in y_train and y_test
# Research: How can you count values in PyTorch tensors?

# TODO: Print summary of your prepared data:
# - Total samples, features, train size, test size
